Page 308 -
P. 308

3:20 Page 271
                               13-ch06-243-278-9780123814791
                                                            2011/6/1
                         HAN
                                                                                    #29
                                                                                   6.4 Summary   271


                               different values on some subtly different data sets. Let’s examine data sets D 5 and D 6 ,
                               shown earlier in Table 6.9, where the two events m and c have unbalanced conditional
                               probabilities. That is, the ratio of mc to c is greater than 0.9. This means that knowing
                               that c occurs should strongly suggest that m occurs also. The ratio of mc to m is less than
                               0.1, indicating that m implies that c is quite unlikely to occur. The all confidence and
                               cosine measures view both cases as negatively associated and the Kulc measure views
                               both as neutral. The max confidence measure claims strong positive associations for
                               these cases. The measures give very diverse results!
                                 “Which measure intuitively reflects the true relationship between the purchase of milk
                               and coffee?” Due to the “balanced” skewness of the data, it is difficult to argue whether
                               the two data sets have positive or negative association. From one point of view, only
                               mc/(mc + mc) = 1000/(1000 + 10,000) = 9.09% of milk-related transactions contain
                               coffee in D 5 and this percentage is 1000/(1000 + 100,000) = 0.99% in D 6 , both indi-
                               cating a negative association. On the other hand, 90.9% of transactions in D 5 (i.e.,
                               mc/(mc + mc) = 1000/(1000 + 100)) and 9% in D 6 (i.e., 1000/(1000 + 10)) contain-
                               ing coffee contain milk as well, which indicates a positive association between milk and
                               coffee. These draw very different conclusions.
                                 For such “balanced” skewness, it could be fair to treat it as neutral, as Kulc does,
                               and in the meantime indicate its skewness using the imbalance ratio (IR). According to
                               Eq. (6.13), for D 4 we have IR(m,c) = 0, a perfectly balanced case; for D 5 , IR(m,c) =
                               0.89, a rather imbalanced case; whereas for D 6 , IR(m,c) = 0.99, a very skewed case.
                               Therefore, the two measures, Kulc and IR, work together, presenting a clear picture for
                               all three data sets, D 4 through D 6 .

                                 In summary, the use of only support and confidence measures to mine associa-
                               tions may generate a large number of rules, many of which can be uninteresting to
                               users. Instead, we can augment the support–confidence framework with a pattern inter-
                               estingness measure, which helps focus the mining toward rules with strong pattern
                               relationships. The added measure substantially reduces the number of rules gener-
                               ated and leads to the discovery of more meaningful rules. Besides those introduced in
                               this section, many other interestingness measures have been studied in the literature.
                               Unfortunately, most of them do not have the null-invariance property. Because large
                               data sets typically have many null-transactions, it is important to consider the null-
                               invariance property when selecting appropriate interestingness measures for pattern
                               evaluation. Among the four null-invariant measures studied here, namely all confidence,
                               max confidence, Kulc, and cosine, we recommend using Kulc in conjunction with the
                               imbalance ratio.


                       6.4     Summary



                                 The discovery of frequent patterns, associations, and correlation relationships among
                                 huge amounts of data is useful in selective marketing, decision analysis, and business
                                 management. A popular area of application is market basket analysis, which studies
   303   304   305   306   307   308   309   310   311   312   313