Page 299 -

P. 299

13-ch06-243-278-9780123814791
2011/6/1
HAN
262 Chapter 6 Mining Frequent Patterns, Associations, and Correlations 3:20 Page 262 #20

The computation is done by intersection of the TID sets of the frequent k-itemsets to
compute the TID sets of the corresponding (k + 1)-itemsets. This process repeats, with
k incremented by 1 each time, until no frequent itemsets or candidate itemsets can be
found.
Besides taking advantage of the Apriori property in the generation of candidate
(k + 1)-itemset from frequent k-itemsets, another merit of this method is that there
is no need to scan the database to ﬁnd the support of (k + 1)-itemsets (for k ≥ 1).
This is because the TID set of each k-itemset carries the complete information required
for counting such support. However, the TID sets can be quite long, taking substantial
memory space as well as computation time for intersecting the long sets.
To further reduce the cost of registering long TID sets, as well as the subsequent
costs of intersections, we can use a technique called diffset, which keeps track of only
the differences of the TID sets of a (k + 1)-itemset and a corresponding k-itemset. For
instance, in Example 6.6 we have {I1} = {T100, T400, T500, T700, T800, T900} and {I1,
I2} = {T100, T400, T800, T900}. The diffset between the two is diffset({I1, I2}, {I1}) =
{T500, T700}. Thus, rather than recording the four TIDs that make up the intersection of
{I1} and {I2}, we can instead use diffset to record just two TIDs, indicating the difference
between {I1} and {I1, I2}. Experiments show that in certain situations, such as when the
data set contains many dense and long patterns, this technique can substantially reduce
the total cost of vertical format mining of frequent itemsets.

6.2.6 Mining Closed and Max Patterns
In Section 6.1.2 we saw how frequent itemset mining may generate a huge number of
frequent itemsets, especially when the min sup threshold is set low or when there exist
9
long patterns in the data set. Example 6.2 showed that closed frequent itemsets can
substantially reduce the number of patterns generated in frequent itemset mining while
preserving the complete information regarding the set of frequent itemsets. That is, from
the set of closed frequent itemsets, we can easily derive the set of frequent itemsets and
their support. Thus, in practice, it is more desirable to mine the set of closed frequent
itemsets rather than the set of all frequent itemsets in most cases.
“How can we mine closed frequent itemsets?” A naïve approach would be to ﬁrst mine
the complete set of frequent itemsets and then remove every frequent itemset that is a
proper subset of, and carries the same support as, an existing frequent itemset. However,
this is quite costly. As shown in Example 6.2, this method would have to ﬁrst derive
2 100 − 1 frequent itemsets to obtain a length-100 frequent itemset, all before it could
begin to eliminate redundant itemsets. This is prohibitively expensive. In fact, there exist
only a very small number of closed frequent itemsets in Example 6.2’s data set.
A recommended methodology is to search for closed frequent itemsets directly dur-
ing the mining process. This requires us to prune the search space as soon as we

9 Remember that X is a closed frequent itemset in a data set S if there exists no proper super-itemset Y
such that Y has the same support count as X in S, and X satisﬁes minimum support.

294 295 296 297 298 299 300 301 302 303 304