Page 303 -
P. 303

13-ch06-243-278-9780123814791
                                                            2011/6/1
                         HAN
          266   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 266  #24



                           Lift is a simple correlation measure that is given as follows. The occurrence of itemset
                         A is independent of the occurrence of itemset B if P(A ∪ B) = P(A)P(B); otherwise,
                         itemsets A and B are dependent and correlated as events. This definition can easily be
                         extended to more than two itemsets. The lift between the occurrence of A and B can be
                         measured by computing


                                                            P(A ∪ B)
                                                  lift(A, B) =      .                     (6.8)
                                                            P(A)P(B)
                         If the resulting value of Eq. (6.8) is less than 1, then the occurrence of A is negatively
                         correlated with the occurrence of B, meaning that the occurrence of one likely leads to
                         the absence of the other one. If the resulting value is greater than 1, then A and B are
                         positively correlated, meaning that the occurrence of one implies the occurrence of the
                         other. If the resulting value is equal to 1, then A and B are independent and there is no
                         correlation between them.
                           Equation (6.8) is equivalent to P(B|A)/P(B), or conf(A ⇒ B)/sup(B), which is also
                         referred to as the lift of the association (or correlation) rule A ⇒ B. In other words, it
                         assesses the degree to which the occurrence of one “lifts” the occurrence of the other. For
                         example, if A corresponds to the sale of computer games and B corresponds to the sale
                         of videos, then given the current market conditions, the sale of games is said to increase
                         or “lift” the likelihood of the sale of videos by a factor of the value returned by Eq. (6.8).
                           Let’s go back to the computer game and video data of Example 6.7.

            Example 6.8 Correlation analysis using lift. To help filter out misleading “strong” associations of
                         the form A ⇒ B from the data of Example 6.7, we need to study how the two item-
                         sets, A and B, are correlated. Let game refer to the transactions of Example 6.7 that do
                         not contain computer games, and video refer to those that do not contain videos. The
                         transactions can be summarized in a contingency table, as shown in Table 6.6.
                           From the table, we can see that the probability of purchasing a computer game
                         is P({game}) = 0.60, the probability of purchasing a video is P({video}) = 0.75, and
                         the probability of purchasing both is P({game,video}) = 0.40. By Eq. (6.8), the lift of
                         Rule (6.6) is P({game, video})/(P({game}) × P({video})) = 0.40/(0.60 × 0.75) = 0.89.
                         Because this value is less than 1, there is a negative correlation between the occur-
                         rence of {game} and {video}. The numerator is the likelihood of a customer purchasing
                         both, while the denominator is what the likelihood would have been if the two pur-
                         chases were completely independent. Such a negative correlation cannot be identified
                         by a support–confidence framework.

                                                                       2
                           The second correlation measure that we study is the χ measure, which was intro-
                                                               2
                         duced in Chapter 3 (Eq. 3.1). To compute the χ value, we take the squared difference
                         between the observed and expected value for a slot (A and B pair) in the contin-
                         gency table, divided by the expected value. This amount is summed for all slots of the
                                                     2
                         contingency table. Let’s perform a χ analysis of Example 6.8.
   298   299   300   301   302   303   304   305   306   307   308