Page 307 -
P. 307

HAN
                                                            2011/6/1
                               13-ch06-243-278-9780123814791
          270   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 270  #28



                           Similarly, in D 3 , the four new measures correctly show that m and c are strongly
                         negatively associated because the m to c ratio equals the mc to m ratio, that is,
                                                        2
                         100/1100 = 9.1%. However, lift and χ both contradict this in an incorrect way: Their
                         values for D 2 are between those for D 1 and D 3 .
                                                      2
                           For data set D 4 , both lift and χ indicate a highly positive association between
                         m and c, whereas the others indicate a “neutral” association because the ratio of mc to
                         mc equals the ratio of mc to mc, which is 1. This means that if a customer buys
                         coffee (or milk), the probability that he or she will also purchase milk (or coffee) is
                         exactly 50%.

                                            2
                           “Why are lift and χ so poor at distinguishing pattern association relationships in
                         the previous transactional data sets?” To answer this, we have to consider the null-
                         transactions. A null-transaction is a transaction that does not contain any of the item-
                         sets being examined. In our example, mc represents the number of null-transactions.
                                 2
                         Lift and χ have difficulty distinguishing interesting pattern association relationships
                         because they are both strongly influenced by mc. Typically, the number of null-
                         transactions can outweigh the number of individual purchases because, for example,
                         many people may buy neither milk nor coffee. On the other hand, the other four
                         measures are good indicators of interesting pattern associations because their defi-
                         nitions remove the influence of mc (i.e., they are not influenced by the number of
                         null-transactions).
                           This discussion shows that it is highly desirable to have a measure that has a value
                         that is independent of the number of null-transactions. A measure is null-invariant if
                         its value is free from the influence of null-transactions. Null-invariance is an impor-
                         tant property for measuring association patterns in large transaction databases. Among
                                                                           2
                         the six discussed measures in this subsection, only lift and χ are not null-invariant
                         measures.
                           “Among the all confidence, max confidence, Kulczynski, and cosine measures, which
                         is best at indicating interesting pattern relationships?”
                           To answer this question, we introduce the imbalance ratio (IR), which assesses the
                         imbalance of two itemsets, A and B, in rule implications. It is defined as

                                                        |sup(A) − sup(B)|
                                          IR(A,B) =                        ,             (6.13)
                                                   sup(A) + sup(B) − sup(A ∪ B)
                         where the numerator is the absolute value of the difference between the support of the
                         itemsets A and B, and the denominator is the number of transactions containing A or
                         B. If the two directional implications between A and B are the same, then IR(A,B) will
                         be zero. Otherwise, the larger the difference between the two, the larger the imbalance
                         ratio. This ratio is independent of the number of null-transactions and independent of
                         the total number of transactions.
                           Let’s continue examining the remaining data sets in Example 6.10.

           Example 6.11 Comparing null-invariant measures in pattern evaluation. Although the four mea-
                         sures introduced in this section are null-invariant, they may present dramatically
   302   303   304   305   306   307   308   309   310   311   312