Page 496 - Discrete Mathematics and Its Applications
P. 496

7.3 Bayes’ Theorem  475


                                                     event that the message contains word w i , assuming that the number of incoming spam messages
                                                     is approximately the same as the number of incoming messages that are not spam, and that the
                                                     events E i | S are independent, then by Bayes’theorem the probability that a message containing
                                                     all the words w 1 , w 2 ,..., w k is spam is

                                                              k                 k
                                                                                   p(E i | S)
                                                                                i=1
                                                        p(S |                                      .
                                                                        k               k
                                                                E i ) =
                                                             i=1        i=1  p(E i | S) +  i=1  p(E i | S)
                                                     We can estimate this probability by
                                                                                   k
                                                                                   i=1  p(w i )
                                                                             k           k       .
                                                        r(w 1 , w 2 ,..., w k ) =
                                                                             i=1  p(w i ) +  i=1  q(w i )
                                                        For the most effective spam filter, we choose words for which the probability that each
                                                     of these words appears in spam is either very high or very low. When we compute this value
                                                     for a particular message, we reject the message as spam if r(w 1 , w 2 ,..., w k ) exceeds a preset
                                                     threshold, such as 0.9.
                                 Bayesian poisoning, the
                                 insertion of extra words to  Another way to improve the performance of a Bayesian spam filter is to look at the prob-
                                 defeat spam filters, can  abilities that particular pairs of words appear in spam and in messages that are not spam. We
                                 use random or       then treat appearances of these pairs of words as appearance of a single block, rather than as
                                 purposefully selected  the appearance of two separate words. For example, the pair of words “enhance performance”
                                 words.
                                                     most likely indicates spam, while “operatic performance” indicates a message that is not spam.
                                                     Similarly, we can assess the likelihood that a message is spam by examining the structure of a
                                                     message to determine where words appear in it.Also, spam filters look at appearances of certain
                                                     types of strings of characters rather than just words. For example, a message with the valid
                                                     e-mail address of one of your friends is less likely to be spam (if not sent by a worm) than one
                                                     containing an e-mail address that came from a country known to originate a lot of spam. There
                                                     is an ongoing war between people who create spam and those trying to filter their messages out.
                                                     This leads to the introduction of many new techniques to defeat spam filters, including inserting
                                                     into spam messages long strings of words that appear in messages that are not spam, as well as
                                                     including words inside pictures. The techniques we have discussed here are only the first steps
                                                     in fighting this war on spam.

                                 Exercises


                                   1. Suppose that E and F are events in a sample space and  six black balls. What is the probability that Ann picked
                                     p(E) = 1/3, p(F) = 1/2, and p(E | F) = 2/5. Find    a ball from the second box if she has selected an orange
                                     p(F | E).                                           ball?
                                   2. Suppose that E and F are events in a sample space and  5. Suppose that 8% of all bicycle racers use steroids, that
                                     p(E) = 2/3, p(F) = 3/4, and p(F | E) = 5/8. Find    a bicyclist who uses steroids tests positive for steroids
                                     p(E | F).                                           96% of the time, and that a bicyclist who does not use
                                                                                         steroids tests positive for steroids 9% of the time. What
                                   3. Suppose that Frida selects a ball by first picking one of
                                     two boxes at random and then selecting a ball from this  is the probability that a randomly selected bicyclist who
                                     box at random. The first box contains two white balls and  tests positive for steroids actually uses steroids?
                                     three blue balls, and the second box contains four white  6. When a test for steroids is given to soccer players, 98%
                                     balls and one blue ball. What is the probability that Frida  of the players taking steroids test positive and 12% of the
                                     picked a ball from the first box if she has selected a blue  players not taking steroids test positive. Suppose that 5%
                                     ball?                                               of soccer players take steroids. What is the probability
                                                                                         that a soccer player who tests positive takes steroids?
                                   4. Suppose thatAnn selects a ball by first picking one of two
                                     boxes at random and then selecting a ball from this box.  7. Suppose that a test for opium use has a 2% false positive
                                     The first box contains three orange balls and four black  rate and a 5% false negative rate. That is, 2% of peo-
                                     balls, and the second box contains five orange balls and  ple who do not use opium test positive for opium, and
   491   492   493   494   495   496   497   498   499   500   501