Page 293 -

P. 293

13-ch06-243-278-9780123814791
HAN
2011/6/1
256 Chapter 6 Mining Frequent Patterns, Associations, and Correlations 3:20 Page 256 #14

Phase I
Phase II
Divide D Find the Combine Find global Frequent
Transactions into n frequent all local frequent itemsets
in D itemsets itemsets
partitions frequent in D
local to each itemsets among
partition to form candidates
(1 scan) candidate (1 scan)
itemset

Figure 6.6 Mining by partitioning the data.

a second scan of D is conducted in which the actual support of each candidate is
assessed to determine the global frequent itemsets. Partition size and the number of
partitions are set so that each partition can ﬁt into main memory and therefore be
read only once in each phase.
Sampling (mining on a subset of the given data): The basic idea of the sampling
approach is to pick a random sample S of the given data D, and then search for
frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy
against efﬁciency. The S sample size is such that the search for frequent itemsets in S
can be done in main memory, and so only one scan of the transactions in S is required
overall. Because we are searching for frequent itemsets in S rather than in D, it is
possible that we will miss some of the global frequent itemsets.
To reduce this possibility, we use a lower support threshold than minimum support
S
to ﬁnd the frequent itemsets local to S (denoted L ). The rest of the database is
S
then used to compute the actual frequencies of each itemset in L . A mechanism is
S
used to determine whether all the global frequent itemsets are included in L . If L S
actually contains all the frequent itemsets in D, then only one scan of D is required.
Otherwise, a second pass can be done to ﬁnd the frequent itemsets that were missed
in the ﬁrst pass. The sampling approach is especially beneﬁcial when efﬁciency is of
utmost importance such as in computationally intensive applications that must be
run frequently.
Dynamic itemset counting (adding candidate itemsets at different points during a
scan): A dynamic itemset counting technique was proposed in which the database
is partitioned into blocks marked by start points. In this variation, new candidate
itemsets can be added at any start point, unlike in Apriori, which determines new
candidate itemsets only immediately before each complete database scan. The tech-
nique uses the count-so-far as the lower bound of the actual count. If the count-so-far
passes the minimum support, the itemset is added into the frequent itemset collection
and can be used to generate longer candidates. This leads to fewer database scans than
with Apriori for ﬁnding all the frequent itemsets.

Other variations are discussed in the next chapter.

288 289 290 291 292 293 294 295 296 297 298