Page 496 - Discrete Mathematics and Its Applications
P. 496
7.3 Bayes’ Theorem 475
event that the message contains word w i , assuming that the number of incoming spam messages
is approximately the same as the number of incoming messages that are not spam, and that the
events E i | S are independent, then by Bayes’theorem the probability that a message containing
all the words w 1 , w 2 ,..., w k is spam is
k k
p(E i | S)
i=1
p(S | .
k k
E i ) =
i=1 i=1 p(E i | S) + i=1 p(E i | S)
We can estimate this probability by
k
i=1 p(w i )
k k .
r(w 1 , w 2 ,..., w k ) =
i=1 p(w i ) + i=1 q(w i )
For the most effective spam filter, we choose words for which the probability that each
of these words appears in spam is either very high or very low. When we compute this value
for a particular message, we reject the message as spam if r(w 1 , w 2 ,..., w k ) exceeds a preset
threshold, such as 0.9.
Bayesian poisoning, the
insertion of extra words to Another way to improve the performance of a Bayesian spam filter is to look at the prob-
defeat spam filters, can abilities that particular pairs of words appear in spam and in messages that are not spam. We
use random or then treat appearances of these pairs of words as appearance of a single block, rather than as
purposefully selected the appearance of two separate words. For example, the pair of words “enhance performance”
words.
most likely indicates spam, while “operatic performance” indicates a message that is not spam.
Similarly, we can assess the likelihood that a message is spam by examining the structure of a
message to determine where words appear in it.Also, spam filters look at appearances of certain
types of strings of characters rather than just words. For example, a message with the valid
e-mail address of one of your friends is less likely to be spam (if not sent by a worm) than one
containing an e-mail address that came from a country known to originate a lot of spam. There
is an ongoing war between people who create spam and those trying to filter their messages out.
This leads to the introduction of many new techniques to defeat spam filters, including inserting
into spam messages long strings of words that appear in messages that are not spam, as well as
including words inside pictures. The techniques we have discussed here are only the first steps
in fighting this war on spam.
Exercises
1. Suppose that E and F are events in a sample space and six black balls. What is the probability that Ann picked
p(E) = 1/3, p(F) = 1/2, and p(E | F) = 2/5. Find a ball from the second box if she has selected an orange
p(F | E). ball?
2. Suppose that E and F are events in a sample space and 5. Suppose that 8% of all bicycle racers use steroids, that
p(E) = 2/3, p(F) = 3/4, and p(F | E) = 5/8. Find a bicyclist who uses steroids tests positive for steroids
p(E | F). 96% of the time, and that a bicyclist who does not use
steroids tests positive for steroids 9% of the time. What
3. Suppose that Frida selects a ball by first picking one of
two boxes at random and then selecting a ball from this is the probability that a randomly selected bicyclist who
box at random. The first box contains two white balls and tests positive for steroids actually uses steroids?
three blue balls, and the second box contains four white 6. When a test for steroids is given to soccer players, 98%
balls and one blue ball. What is the probability that Frida of the players taking steroids test positive and 12% of the
picked a ball from the first box if she has selected a blue players not taking steroids test positive. Suppose that 5%
ball? of soccer players take steroids. What is the probability
that a soccer player who tests positive takes steroids?
4. Suppose thatAnn selects a ball by first picking one of two
boxes at random and then selecting a ball from this box. 7. Suppose that a test for opium use has a 2% false positive
The first box contains three orange balls and four black rate and a 5% false negative rate. That is, 2% of peo-
balls, and the second box contains five orange balls and ple who do not use opium test positive for opium, and

