Page 284 -
P. 284
#5
Page 247
3:20
2011/6/1
HAN 13-ch06-243-278-9780123814791
6.1 Basic Concepts 247
Equation (6.4) shows that the confidence of rule A ⇒ B can be easily derived from the
support counts of A and A ∪ B. That is, once the support counts of A, B, and A ∪ B are
found, it is straightforward to derive the corresponding association rules A ⇒ B and
B ⇒ A and check whether they are strong. Thus, the problem of mining association
rules can be reduced to that of mining frequent itemsets.
In general, association rule mining can be viewed as a two-step process:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as
frequently as a predetermined minimum support count, min sup.
2. Generate strong association rules from the frequent itemsets: By definition, these
rules must satisfy minimum support and minimum confidence.
Additional interestingness measures can be applied for the discovery of correlation
relationships between associated items, as will be discussed in Section 6.3. Because
the second step is much less costly than the first, the overall performance of mining
association rules is determined by the first step.
A major challenge in mining frequent itemsets from a large data set is the fact that
such mining often generates a huge number of itemsets satisfying the minimum support
(min sup) threshold, especially when min sup is set low. This is because if an itemset is
frequent, each of its subsets is frequent as well. A long itemset will contain a combinato-
rial number of shorter, frequent sub-itemsets. For example, a frequent itemset of length
100
100, such as {a 1 , a 2 ,..., a 100 }, contains = 100 frequent 1-itemsets: {a 1 }, {a 2 }, ...,
1
100
{a 100 }; frequent 2-itemsets: {a 1 , a 2 }, {a 1 , a 3 }, ..., {a 99 , a 100 }; and so on. The total
2
number of frequent itemsets that it contains is thus
100 100 100 100 30
+ + ··· + = 2 − 1 ≈ 1.27 × 10 . (6.5)
1 2 100
This is too huge a number of itemsets for any computer to compute or store. To over-
come this difficulty, we introduce the concepts of closed frequent itemset and maximal
frequent itemset.
5
An itemset X is closed in a data set D if there exists no proper super-itemset Y such
that Y has the same support count as X in D. An itemset X is a closed frequent itemset in
set D if X is both closed and frequent in D. An itemset X is a maximal frequent itemset
(or max-itemset) in a data set D if X is frequent, and there exists no super-itemset Y
such that X ⊂ Y and Y is frequent in D.
Let C be the set of closed frequent itemsets for a data set D satisfying a minimum sup-
port threshold, min sup. Let M be the set of maximal frequent itemsets for D satisfying
min sup. Suppose that we have the support count of each itemset in C and M. Notice
that C and its count information can be used to derive the whole set of frequent itemsets.
5 Y is a proper super-itemset of X if X is a proper sub-itemset of Y, that is, if X ⊂ Y. In other words,
every item of X is contained in Y but there is at least one item of Y that is not in X.