Page 293 -
P. 293

13-ch06-243-278-9780123814791
                         HAN
                                                            2011/6/1
          256   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 256  #14



                                                  Phase I
                                                                         Phase II
                                      Divide D    Find the   Combine    Find global  Frequent
                          Transactions  into n    frequent   all local   frequent   itemsets
                             in D                 itemsets               itemsets
                                      partitions             frequent                in D
                                                local to each  itemsets  among
                                                  partition  to form    candidates
                                                  (1 scan)   candidate   (1 scan)
                                                             itemset



               Figure 6.6 Mining by partitioning the data.

                          a second scan of D is conducted in which the actual support of each candidate is
                          assessed to determine the global frequent itemsets. Partition size and the number of
                          partitions are set so that each partition can fit into main memory and therefore be
                          read only once in each phase.
                         Sampling (mining on a subset of the given data): The basic idea of the sampling
                          approach is to pick a random sample S of the given data D, and then search for
                          frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy
                          against efficiency. The S sample size is such that the search for frequent itemsets in S
                          can be done in main memory, and so only one scan of the transactions in S is required
                          overall. Because we are searching for frequent itemsets in S rather than in D, it is
                          possible that we will miss some of the global frequent itemsets.
                            To reduce this possibility, we use a lower support threshold than minimum support
                                                                     S
                          to find the frequent itemsets local to S (denoted L ). The rest of the database is
                                                                               S
                          then used to compute the actual frequencies of each itemset in L . A mechanism is
                                                                                        S
                          used to determine whether all the global frequent itemsets are included in L . If L S
                          actually contains all the frequent itemsets in D, then only one scan of D is required.
                          Otherwise, a second pass can be done to find the frequent itemsets that were missed
                          in the first pass. The sampling approach is especially beneficial when efficiency is of
                          utmost importance such as in computationally intensive applications that must be
                          run frequently.
                         Dynamic itemset counting (adding candidate itemsets at different points during a
                          scan): A dynamic itemset counting technique was proposed in which the database
                          is partitioned into blocks marked by start points. In this variation, new candidate
                          itemsets can be added at any start point, unlike in Apriori, which determines new
                          candidate itemsets only immediately before each complete database scan. The tech-
                          nique uses the count-so-far as the lower bound of the actual count. If the count-so-far
                          passes the minimum support, the itemset is added into the frequent itemset collection
                          and can be used to generate longer candidates. This leads to fewer database scans than
                          with Apriori for finding all the frequent itemsets.

                         Other variations are discussed in the next chapter.
   288   289   290   291   292   293   294   295   296   297   298