Page 293 -
P. 293
13-ch06-243-278-9780123814791
HAN
2011/6/1
256 Chapter 6 Mining Frequent Patterns, Associations, and Correlations 3:20 Page 256 #14
Phase I
Phase II
Divide D Find the Combine Find global Frequent
Transactions into n frequent all local frequent itemsets
in D itemsets itemsets
partitions frequent in D
local to each itemsets among
partition to form candidates
(1 scan) candidate (1 scan)
itemset
Figure 6.6 Mining by partitioning the data.
a second scan of D is conducted in which the actual support of each candidate is
assessed to determine the global frequent itemsets. Partition size and the number of
partitions are set so that each partition can fit into main memory and therefore be
read only once in each phase.
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy
against efficiency. The S sample size is such that the search for frequent itemsets in S
can be done in main memory, and so only one scan of the transactions in S is required
overall. Because we are searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets.
To reduce this possibility, we use a lower support threshold than minimum support
S
to find the frequent itemsets local to S (denoted L ). The rest of the database is
S
then used to compute the actual frequencies of each itemset in L . A mechanism is
S
used to determine whether all the global frequent itemsets are included in L . If L S
actually contains all the frequent itemsets in D, then only one scan of D is required.
Otherwise, a second pass can be done to find the frequent itemsets that were missed
in the first pass. The sampling approach is especially beneficial when efficiency is of
utmost importance such as in computationally intensive applications that must be
run frequently.
Dynamic itemset counting (adding candidate itemsets at different points during a
scan): A dynamic itemset counting technique was proposed in which the database
is partitioned into blocks marked by start points. In this variation, new candidate
itemsets can be added at any start point, unlike in Apriori, which determines new
candidate itemsets only immediately before each complete database scan. The tech-
nique uses the count-so-far as the lower bound of the actual count. If the count-so-far
passes the minimum support, the itemset is added into the frequent itemset collection
and can be used to generate longer candidates. This leads to fewer database scans than
with Apriori for finding all the frequent itemsets.
Other variations are discussed in the next chapter.