Page 286 -
P. 286

3:20
                                                                                    #7
                                                                           Page 249
                                                             2011/6/1
                          HAN 13-ch06-243-278-9780123814791
                                                               6.2 Frequent Itemset Mining Methods  249


                               collecting those items that satisfy minimum support. The resulting set is denoted by L 1 .
                               Next, L 1 is used to find L 2 , the set of frequent 2-itemsets, which is used to find L 3 , and
                               so on, until no more frequent k-itemsets can be found. The finding of each L k requires
                               one full scan of the database.
                                 To improve the efficiency of the level-wise generation of frequent itemsets, an
                               important property called the Apriori property is used to reduce the search space.

                               Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
                                 The Apriori property is based on the following observation. By definition, if an item-
                               set I does not satisfy the minimum support threshold, min sup, then I is not frequent,
                               that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset
                               (i.e., I ∪ A) cannot occur more frequently than I. Therefore, I ∪ A is not frequent either,
                               that is, P(I ∪ A) < min sup.
                                 This property belongs to a special category of properties called antimonotonicity in
                               the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It
                               is called antimonotonicity because the property is monotonic in the context of failing a
                               test. 6
                                 “How is the Apriori property used in the algorithm?” To understand this, let us look at
                               how L k−1 is used to find L k for k ≥ 2. A two-step process is followed, consisting of join
                               and prune actions.
                               1. The join step: To find L k , a set of candidate k-itemsets is generated by joining
                                 L k−1 with itself. This set of candidates is denoted C k . Let l 1 and l 2 be itemsets
                                 in L k−1 . The notation l i [j] refers to the jth item in l i (e.g., l 1 [k − 2] refers to
                                 the second to the last item in l 1 ). For efficient implementation, Apriori assumes
                                 that items within a transaction or itemset are sorted in lexicographic order. For
                                 the (k − 1)-itemset, l i , this means that the items are sorted such that l i [1] < l i [2]
                                 < ··· < l i [k − 1]. The join, L k−1 1 L k−1 , is performed, where members of L k−1 are
                                 joinable if their first (k − 2) items are in common. That is, members l 1 and l 2
                                 of L k−1 are joined if (l 1 [1] = l 2 [1]) ∧ (l 1 [2] = l 2 [2]) ∧ ··· ∧ (l 1 [k − 2] = l 2 [k − 2])
                                 ∧(l 1 [k − 1] < l 2 [k − 1]). The condition l 1 [k − 1] < l 2 [k − 1] simply ensures that
                                 no duplicates are generated. The resulting itemset formed by joining l 1 and l 2 is
                                 {l 1 [1], l 1 [2],..., l 1 [k − 2], l 1 [k − 1], l 2 [k − 1]}.
                               2. The prune step: C k is a superset of L k , that is, its members may or may not be
                                 frequent, but all of the frequent k-itemsets are included in C k . A database scan to
                                 determine the count of each candidate in C k would result in the determination of
                                 L k (i.e., all candidates having a count no less than the minimum support count are
                                 frequent by definition, and therefore belong to L k ). C k , however, can be huge, and so
                                 this could involve heavy computation. To reduce the size of C k , the Apriori property


                               6 The Apriori property has many applications. For example, it can also be used to prune search during
                               data cube computation (Chapter 5).
   281   282   283   284   285   286   287   288   289   290   291