Page 495 - Discrete Mathematics and Its Applications
P. 495

474  7 / Discrete Probability


                                                Because we are assuming that it is equally likely for an incoming message to be spam as it is
                                                not to be spam, we can estimate the probability that an incoming message containing the word
                                                “Rolex” is spam by
                                                                   p(Rolex)            0.125       0.125
                                                    r(Rolex) =                   =               =       ≈ 0.962.
                                                              p(Rolex) + q(Rolex)   0.125 + 0.005  0.130

                                                Because r(Rolex) is greater than the threshold 0.9, we reject such messages as spam.  ▲

                                                    Detecting spam based on the presence of a single word can lead to excessive false positives
                                                and false negatives. Consequently, spam filters look at the presence of multiple words. For
                                                example, suppose that the message contains the words w 1 and w 2 . Let E 1 and E 2 denote the
                                                events that the message contains the words w 1 and w 2 , respectively. To make our computations
                                                simpler, we assume that E 1 and E 2 are independent events and that E 1 | S and E 2 | S are
                                                independent events and that we have no prior knowledge regarding whether or not the message
                                                is spam. (The assumptions that E 1 and E 2 are independent and that E 1 | S and E 2 | S are
                                                independent may introduce some error into our computations; we assume that this error is small.)
                                                Using Bayes’theorem and our assumptions, we can show (see Exercise 23) that p(S | E 1 ∩ E 2 ),
                                                the probability that the message is spam given that it contains both w 1 and w 2 ,is

                                                                             p(E 1 | S)p(E 2 | S)
                                                    p(S | E 1 ∩ E 2 ) =                                 .
                                                                    p(E 1 | S)p(E 2 | S) + p(E 1 | S)p(E 2 | S)
                                                    We estimate the probability p(S | E 1 ∩ E 2 ) by

                                                                     p(w 1 )p(w 2 )
                                                    r(w 1 , w 2 ) =                   .
                                                               p(w 1 )p(w 2 ) + q(w 1 )q(w 2 )
                                                That is, r(w 1 , w 2 ) estimates the probability that the message is spam, given that it contains the
                                                words w 1 and w 2 . When r(w 1 , w 2 ) is greater than a preset threshold, such as 0.9, we determine
                                                that the message is likely spam.
                                 EXAMPLE 4      Suppose that we train a Bayesian spam filter on a set of 2000 spam messages and 1000 messages
                                                that are not spam. The word “stock” appears in 400 spam messages and 60 messages that are not
                                                spam, and the word “undervalued” appears in 200 spam messages and 25 messages that are not
                                                spam. Estimate the probability that an incoming message containing both the words “stock” and
                                                “undervalued” is spam, assuming that we have no prior knowledge about whether it is spam.
                                                Will we reject such messages as spam when we set the threshold at 0.9?

                                                Solution: Using the counts of each of these two words in messages known to be spam
                                                or known not to be spam, we obtain the following estimates: p(stock) = 400/2000 = 0.2,
                                                q(stock) = 60/1000 = 0.06, p(undervalued) = 200/2000 = 0.1, and q(undervalued) =
                                                25/1000 = 0.025. Using these probabilities, we can estimate the probability that the message
                                                is spam by

                                                                                     p(stock)p(undervalued)
                                                    r(stock, undervalued) =
                                                                         p(stock)p(undervalued) + q(stock)q(undervalued)
                                                                                (0.2)(0.1)
                                                                       =                         ≈ 0.930.
                                                                         (0.2)(0.1) + (0.06)(0.025)
                                                Because we have set the threshold for rejecting messages at 0.9, such messages will be rejected
                                                by the filter.                                                                  ▲
                                                    The more words we use to estimate the probability that an incoming mail message is spam,
                                                the better is our chance that we correctly determine whether it is spam. In general, if E i is the
   490   491   492   493   494   495   496   497   498   499   500