Page 494 - Discrete Mathematics and Its Applications
P. 494

7.3 Bayes’ Theorem  473


                                                        WewilldevelopsomebasicBayesianspamfilters.First,supposewehaveaset B ofmessages
                                                     known to be spam and a set G of messages known not to be spam. (For example, users could
                                                     classify messages as spam when they examine them in their inboxes.) We next identify the
                                                     words that occur in B and in G. We count the number of messages in the set containing each
                                                     word to find n B (w) and n G (w), the number of messages containing the word w in the sets B
                                                     and G, respectively. Then, the empirical probability that a spam message contains the word w
                                                     is p(w) = n B (w)/|B|, and the empirical probability that a message that is not spam contains
                                                     the word w is q(w) = n G (w)/|G|. We note that p(w) and q(w) estimate the probabilities that
                                                     an incoming spam message, and an incoming message that is not spam, contain the word w,
                                                     respectively.
                                                        Now suppose we receive a new e-mail message containing the word w. Let S be the event
                                                     that the message is spam. Let E be the event that the message contains the word w. The events S,
                                                     that the message is spam, and S, that the message is not spam, partition the set of all messages.
                                                     Hence, by Bayes’ theorem, the probability that the message is spam, given that it contains the
                                                     word w,is
                                                                          p(E | S)p(S)
                                                        p(S | E) =                           .
                                                                   p(E | S)p(S) + p(E | S)p(S)

                                                        To apply this formula, we first estimate p(S), the probability that an incoming message is
                                                     spam, as well as p(S), the probability that the incoming message is not spam. Without prior
                                                     knowledge about the likelihood that an incoming message is spam, for simplicity we assume
                                                     that the message is equally likely to be spam as it is not to be spam. That is, we assume that
                                                     p(S) = p(S) = 1/2. Using this assumption, we find that the probability that a message is spam,
                                                     given that it contains the word w,is

                                                                        p(E | S)
                                                        p(S | E) =                   .
                                                                   p(E | S) + p(E | S)

                                                     (Note that if we have some empirical data about the ratio of spam messages to messages that are
                                                     not spam, we can change this assumption to produce a better estimate for p(S) and for p(S);
                                                     see Exercise 22.)
                                                        Next, we estimate p(E | S), the conditional probability that the message contains the
                                                     word w given that the message is spam, by p(w). Similarly, we estimate p(E | S), the con-
                                                     ditional probability that the message contains the word w, given that the message is not spam,
                                                     by q(w). Inserting these estimates for p(E | S) and p(E | S) tells us that p(S | E) can be
                                                     estimated by
                                                                   p(w)
                                                        r(w) =            ;
                                                               p(w) + q(w)
                                                     that is, r(w) estimates the probability that the message is spam, given that it contains the
                                                     word w.If r(w) is greater than a threshold that we set, such as 0.9, then we classify the message
                                                     as spam.

                                      EXAMPLE 3      Suppose that we have found that the word “Rolex” occurs in 250 of 2000 messages known to
                                                     be spam and in 5 of 1000 messages known not to be spam. Estimate the probability that an
                                                     incoming message containing the word “Rolex” is spam, assuming that it is equally likely that
                                                     an incoming message is spam or not spam. If our threshold for rejecting a message as spam
                                                     is 0.9, will we reject such messages?

                                                     Solution: We use the counts that the word “Rolex” appears in spam messages and messages that
                                                     are not spam to find that p(Rolex) = 250/2000 = 0.125 and q(Rolex) = 5/1000 = 0.005.
   489   490   491   492   493   494   495   496   497   498   499