Page 306 -
P. 306

3:20 Page 269
                                                                                    #27
                                                            2011/6/1
                         HAN
                               13-ch06-243-278-9780123814791
                                         6.3 Which Patterns Are Interesting?—Pattern Evaluation Methods  269


                     Table 6.8 2 × 2 Contingency Table for Two Items

                                         milk     milk     6 row
                               coffee    mc       mc       c
                               coffee    mc       mc       c
                               6 col     m        m        6




                     Table 6.9 Comparison of Six Pattern Evaluation Measures Using Contingency Tables
                               for a Variety of Data Sets
                               Data
                               Set  mc    mc   mc      mc     χ  2  lift  all conf. max conf. Kulc. cosine
                               D 1  10,000 1000   1000 100,000 90557  9.26 0.91  0.91     0.91  0.91
                               D 2  10,000 1000   1000    100    0   1   0.91    0.91     0.91  0.91
                               D 3    100 1000    1000 100,000  670  8.44 0.09   0.09     0.09  0.09
                               D 4   1000 1000    1000 100,000 24740 25.75 0.5   0.5      0.5  0.5
                               D 5   1000  100  10,000 100,000  8173  9.18 0.09  0.91     0.5  0.29
                               D 6   1000   10 100,000 100,000  965  1.97 0.01   0.99     0.5  0.10



                 Example 6.10 Comparison of six pattern evaluation measures on typical data sets. The relationships
                               between the purchases of two items, milk and coffee, can be examined by summarizing
                               their purchase history in Table 6.8, a 2 × 2 contingency table, where an entry such as mc
                               represents the number of transactions containing both milk and coffee.
                                 Table 6.9 shows a set of transactional data sets with their corresponding contin-
                               gency tables and the associated values for each of the six evaluation measures. Let’s
                               first examine the first four data sets, D 1 through D 4 . From the table, we see that m
                               and c are positively associated in D 1 and D 2 , negatively associated in D 3 , and neu-
                               tral in D 4 . For D 1 and D 2 , m and c are positively associated because mc (10,000)
                               is considerably greater than mc (1000) and mc (1000). Intuitively, for people who
                               bought milk (m = 10,000 + 1000 = 11,000), it is very likely that they also bought coffee
                               (mc/m = 10/11 = 91%), and vice versa.
                                 The results of the four newly introduced measures show that m and c are strongly
                               positively associated in both data sets by producing a measure value of 0.91. However,
                                       2
                               lift and χ generate dramatically different measure values for D 1 and D 2 due to their
                               sensitivity to mc. In fact, in many real-world scenarios, mc is usually huge and unstable.
                               For example, in a market basket database, the total number of transactions could fluctu-
                               ate on a daily basis and overwhelmingly exceed the number of transactions containing
                               any particular itemset. Therefore, a good interestingness measure should not be affected
                               by transactions that do not contain the itemsets of interest; otherwise, it would generate
                               unstable results, as illustrated in D 1 and D 2 .
   301   302   303   304   305   306   307   308   309   310   311