Page 299 -
P. 299

13-ch06-243-278-9780123814791
                                                            2011/6/1
                         HAN
          262   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 262  #20



                         The computation is done by intersection of the TID sets of the frequent k-itemsets to
                         compute the TID sets of the corresponding (k + 1)-itemsets. This process repeats, with
                         k incremented by 1 each time, until no frequent itemsets or candidate itemsets can be
                         found.
                           Besides taking advantage of the Apriori property in the generation of candidate
                         (k + 1)-itemset from frequent k-itemsets, another merit of this method is that there
                         is no need to scan the database to find the support of (k + 1)-itemsets (for k ≥ 1).
                         This is because the TID set of each k-itemset carries the complete information required
                         for counting such support. However, the TID sets can be quite long, taking substantial
                         memory space as well as computation time for intersecting the long sets.
                           To further reduce the cost of registering long TID sets, as well as the subsequent
                         costs of intersections, we can use a technique called diffset, which keeps track of only
                         the differences of the TID sets of a (k + 1)-itemset and a corresponding k-itemset. For
                         instance, in Example 6.6 we have {I1} = {T100, T400, T500, T700, T800, T900} and {I1,
                         I2} = {T100, T400, T800, T900}. The diffset between the two is diffset({I1, I2}, {I1}) =
                         {T500, T700}. Thus, rather than recording the four TIDs that make up the intersection of
                         {I1} and {I2}, we can instead use diffset to record just two TIDs, indicating the difference
                         between {I1} and {I1, I2}. Experiments show that in certain situations, such as when the
                         data set contains many dense and long patterns, this technique can substantially reduce
                         the total cost of vertical format mining of frequent itemsets.





                   6.2.6 Mining Closed and Max Patterns
                         In Section 6.1.2 we saw how frequent itemset mining may generate a huge number of
                         frequent itemsets, especially when the min sup threshold is set low or when there exist
                                                                                          9
                         long patterns in the data set. Example 6.2 showed that closed frequent itemsets can
                         substantially reduce the number of patterns generated in frequent itemset mining while
                         preserving the complete information regarding the set of frequent itemsets. That is, from
                         the set of closed frequent itemsets, we can easily derive the set of frequent itemsets and
                         their support. Thus, in practice, it is more desirable to mine the set of closed frequent
                         itemsets rather than the set of all frequent itemsets in most cases.
                           “How can we mine closed frequent itemsets?” A naïve approach would be to first mine
                         the complete set of frequent itemsets and then remove every frequent itemset that is a
                         proper subset of, and carries the same support as, an existing frequent itemset. However,
                         this is quite costly. As shown in Example 6.2, this method would have to first derive
                         2 100  − 1 frequent itemsets to obtain a length-100 frequent itemset, all before it could
                         begin to eliminate redundant itemsets. This is prohibitively expensive. In fact, there exist
                         only a very small number of closed frequent itemsets in Example 6.2’s data set.
                           A recommended methodology is to search for closed frequent itemsets directly dur-
                         ing the mining process. This requires us to prune the search space as soon as we


                         9 Remember that X is a closed frequent itemset in a data set S if there exists no proper super-itemset Y
                         such that Y has the same support count as X in S, and X satisfies minimum support.
   294   295   296   297   298   299   300   301   302   303   304