Page 494 - Discrete Mathematics and Its Applications

P. 494

7.3 Bayes’ Theorem 473

WewilldevelopsomebasicBayesianspamﬁlters.First,supposewehaveaset B ofmessages
known to be spam and a set G of messages known not to be spam. (For example, users could
classify messages as spam when they examine them in their inboxes.) We next identify the
words that occur in B and in G. We count the number of messages in the set containing each
word to ﬁnd n B (w) and n G (w), the number of messages containing the word w in the sets B
and G, respectively. Then, the empirical probability that a spam message contains the word w
is p(w) = n B (w)/|B|, and the empirical probability that a message that is not spam contains
the word w is q(w) = n G (w)/|G|. We note that p(w) and q(w) estimate the probabilities that
an incoming spam message, and an incoming message that is not spam, contain the word w,
respectively.
Now suppose we receive a new e-mail message containing the word w. Let S be the event
that the message is spam. Let E be the event that the message contains the word w. The events S,
that the message is spam, and S, that the message is not spam, partition the set of all messages.
Hence, by Bayes’ theorem, the probability that the message is spam, given that it contains the
word w,is
p(E | S)p(S)
p(S | E) = .
p(E | S)p(S) + p(E | S)p(S)

To apply this formula, we ﬁrst estimate p(S), the probability that an incoming message is
spam, as well as p(S), the probability that the incoming message is not spam. Without prior
knowledge about the likelihood that an incoming message is spam, for simplicity we assume
that the message is equally likely to be spam as it is not to be spam. That is, we assume that
p(S) = p(S) = 1/2. Using this assumption, we ﬁnd that the probability that a message is spam,
given that it contains the word w,is

p(E | S)
p(S | E) = .
p(E | S) + p(E | S)

(Note that if we have some empirical data about the ratio of spam messages to messages that are
not spam, we can change this assumption to produce a better estimate for p(S) and for p(S);
see Exercise 22.)
Next, we estimate p(E | S), the conditional probability that the message contains the
word w given that the message is spam, by p(w). Similarly, we estimate p(E | S), the con-
ditional probability that the message contains the word w, given that the message is not spam,
by q(w). Inserting these estimates for p(E | S) and p(E | S) tells us that p(S | E) can be
estimated by
p(w)
r(w) = ;
p(w) + q(w)
that is, r(w) estimates the probability that the message is spam, given that it contains the
word w.If r(w) is greater than a threshold that we set, such as 0.9, then we classify the message
as spam.

EXAMPLE 3 Suppose that we have found that the word “Rolex” occurs in 250 of 2000 messages known to
be spam and in 5 of 1000 messages known not to be spam. Estimate the probability that an
incoming message containing the word “Rolex” is spam, assuming that it is equally likely that
an incoming message is spam or not spam. If our threshold for rejecting a message as spam
is 0.9, will we reject such messages?

Solution: We use the counts that the word “Rolex” appears in spam messages and messages that
are not spam to ﬁnd that p(Rolex) = 250/2000 = 0.125 and q(Rolex) = 5/1000 = 0.005.

489 490 491 492 493 494 495 496 497 498 499