Page 286 -
P. 286
3:20
#7
Page 249
2011/6/1
HAN 13-ch06-243-278-9780123814791
6.2 Frequent Itemset Mining Methods 249
collecting those items that satisfy minimum support. The resulting set is denoted by L 1 .
Next, L 1 is used to find L 2 , the set of frequent 2-itemsets, which is used to find L 3 , and
so on, until no more frequent k-itemsets can be found. The finding of each L k requires
one full scan of the database.
To improve the efficiency of the level-wise generation of frequent itemsets, an
important property called the Apriori property is used to reduce the search space.
Apriori property: All nonempty subsets of a frequent itemset must also be frequent.
The Apriori property is based on the following observation. By definition, if an item-
set I does not satisfy the minimum support threshold, min sup, then I is not frequent,
that is, P(I) < min sup. If an item A is added to the itemset I, then the resulting itemset
(i.e., I ∪ A) cannot occur more frequently than I. Therefore, I ∪ A is not frequent either,
that is, P(I ∪ A) < min sup.
This property belongs to a special category of properties called antimonotonicity in
the sense that if a set cannot pass a test, all of its supersets will fail the same test as well. It
is called antimonotonicity because the property is monotonic in the context of failing a
test. 6
“How is the Apriori property used in the algorithm?” To understand this, let us look at
how L k−1 is used to find L k for k ≥ 2. A two-step process is followed, consisting of join
and prune actions.
1. The join step: To find L k , a set of candidate k-itemsets is generated by joining
L k−1 with itself. This set of candidates is denoted C k . Let l 1 and l 2 be itemsets
in L k−1 . The notation l i [j] refers to the jth item in l i (e.g., l 1 [k − 2] refers to
the second to the last item in l 1 ). For efficient implementation, Apriori assumes
that items within a transaction or itemset are sorted in lexicographic order. For
the (k − 1)-itemset, l i , this means that the items are sorted such that l i [1] < l i [2]
< ··· < l i [k − 1]. The join, L k−1 1 L k−1 , is performed, where members of L k−1 are
joinable if their first (k − 2) items are in common. That is, members l 1 and l 2
of L k−1 are joined if (l 1 [1] = l 2 [1]) ∧ (l 1 [2] = l 2 [2]) ∧ ··· ∧ (l 1 [k − 2] = l 2 [k − 2])
∧(l 1 [k − 1] < l 2 [k − 1]). The condition l 1 [k − 1] < l 2 [k − 1] simply ensures that
no duplicates are generated. The resulting itemset formed by joining l 1 and l 2 is
{l 1 [1], l 1 [2],..., l 1 [k − 2], l 1 [k − 1], l 2 [k − 1]}.
2. The prune step: C k is a superset of L k , that is, its members may or may not be
frequent, but all of the frequent k-itemsets are included in C k . A database scan to
determine the count of each candidate in C k would result in the determination of
L k (i.e., all candidates having a count no less than the minimum support count are
frequent by definition, and therefore belong to L k ). C k , however, can be huge, and so
this could involve heavy computation. To reduce the size of C k , the Apriori property
6 The Apriori property has many applications. For example, it can also be used to prune search during
data cube computation (Chapter 5).