Page 309 -
P. 309

2011/6/1
                               13-ch06-243-278-9780123814791
                         HAN
          272   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 272  #30



                           customers’ buying habits by searching for itemsets that are frequently purchased
                           together (or in sequence).
                           Association rule mining consists of first finding frequent itemsets (sets of items,
                           such as A and B, satisfying a minimum support threshold, or percentage of the task-
                           relevant tuples), from which strong association rules in the form of A ⇒ B are
                           generated. These rules also satisfy a minimum confidence threshold (a prespecified
                           probability of satisfying B under the condition that A is satisfied). Associations can be
                           further analyzed to uncover correlation rules, which convey statistical correlations
                           between itemsets A and B.
                           Many efficient and scalable algorithms have been developed for frequent itemset
                           mining, from which association and correlation rules can be derived. These algo-
                           rithms can be classified into three categories: (1) Apriori-like algorithms, (2) frequent
                           pattern growth–based algorithms such as FP-growth, and (3) algorithms that use the
                           vertical data format.
                           The Apriori algorithm is a seminal algorithm for mining frequent itemsets for
                           Boolean association rules. It explores the level-wise mining Apriori property that all
                           nonempty subsets of a frequent itemset must also be frequent. At the kth iteration (for
                           k ≥ 2), it forms frequent k-itemset candidates based on the frequent (k − 1)-itemsets,
                           and scans the database once to find the complete set of frequent k-itemsets, L k .
                              Variations involving hashing and transaction reduction can be used to make the
                           procedure more efficient. Other variations include partitioning the data (mining on
                           each partition and then combining the results) and sampling the data (mining on
                           a data subset). These variations can reduce the number of data scans required to as
                           little as two or even one.
                           Frequent pattern growth is a method of mining frequent itemsets without candidate
                           generation. It constructs a highly compact data structure (an FP-tree) to compress the
                           originaltransactiondatabase.Ratherthanemployingthegenerate-and-teststrategyof
                           Apriori-like methods, it focuses on frequent pattern (fragment) growth, which avoids
                           costly candidate generation, resulting in greater efficiency.
                           Mining frequent itemsets using the vertical data format (Eclat) is a method that
                           transforms a given data set of transactions in the horizontal data format of TID-
                           itemset into the vertical data format of item-TID set. It mines the transformed
                           data set by TID set intersections based on the Apriori property and additional
                           optimization techniques such as diffset.

                           Not all strong association rules are interesting. Therefore, the support–confidence
                           framework should be augmented with a pattern evaluation measure, which promotes
                           the mining of interesting rules. A measure is null-invariant if its value is free from
                           the influence of null-transactions (i.e., the transactions that do not contain any of
                           the itemsets being examined). Among many pattern evaluation measures, we exam-
                                    2
                           ined lift, χ , all confidence, max confidence, Kulczynski, and cosine, and showed
   304   305   306   307   308   309   310   311   312   313   314