Page 294 -
P. 294
3:20 Page 257
#15
13-ch06-243-278-9780123814791
2011/6/1
HAN
6.2 Frequent Itemset Mining Methods 257
6.2.4 A Pattern-Growth Approach for Mining
Frequent Itemsets
As we have seen, in many cases the Apriori candidate generate-and-test method signifi-
cantly reduces the size of candidate sets, leading to good performance gain. However, it
can suffer from two nontrivial costs:
It may still need to generate a huge number of candidate sets. For example, if there are
4
10 frequent 1-itemsets, the Apriori algorithm will need to generate more than 10 7
candidate 2-itemsets.
It may need to repeatedly scan the whole database and check a large set of candidates by
pattern matching. It is costly to go over each transaction in the database to determine
the support of the candidate itemsets.
“Can we design a method that mines the complete set of frequent itemsets without such
a costly candidate generation process?” An interesting method in this attempt is called
frequent pattern growth, or simply FP-growth, which adopts a divide-and-conquer
strategy as follows. First, it compresses the database representing frequent items into a
frequent pattern tree, or FP-tree, which retains the itemset association information. It
then divides the compressed database into a set of conditional databases (a special kind of
projected database), each associated with one frequent item or “pattern fragment,” and
mines each database separately. For each “pattern fragment,” only its associated data sets
need to be examined. Therefore, this approach may substantially reduce the size of the
data sets to be searched, along with the “growth” of patterns being examined. You will
see how it works in Example 6.5.
Example 6.5 FP-growth (finding frequent itemsets without candidate generation). We reexamine
the mining of transaction database, D, of Table 6.1 in Example 6.3 using the frequent
pattern growth approach.
The first scan of the database is the same as Apriori, which derives the set of frequent
items (1-itemsets) and their support counts (frequencies). Let the minimum support
count be 2. The set of frequent items is sorted in the order of descending support count.
This resulting set or list is denoted by L. Thus, we have L = {{I2: 7}, {I1: 6}, {I3: 6},
{I4: 2}, {I5: 2}}.
An FP-tree is then constructed as follows. First, create the root of the tree, labeled
with “null.” Scan database D a second time. The items in each transaction are processed
in L order (i.e., sorted according to descending support count), and a branch is created
for each transaction. For example, the scan of the first transaction, “T100: I1, I2, I5,”
which contains three items (I2, I1, I5 in L order), leads to the construction of the first
branch of the tree with three nodes, hI2: 1i, hI1: 1i, and hI5: 1i, where I2 is linked as a
child to the root, I1 is linked to I2, and I5 is linked to I1. The second transaction, T200,
contains the items I2 and I4 in L order, which would result in a branch where I2 is linked
to the root and I4 is linked to I2. However, this branch would share a common prefix,
I2, with the existing path for T100. Therefore, we instead increment the count of the I2
node by 1, and create a new node, hI4: 1i, which is linked as a child to hI2: 2i. In general,