Page 285 -
P. 285

2011/6/1
                          HAN 13-ch06-243-278-9780123814791

          248   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20  Page 248  #6



                         Thus, we say that C contains complete information regarding its corresponding frequent
                         itemsets. On the other hand, M registers only the support of the maximal itemsets. It
                         usually does not contain the complete support information regarding its corresponding
                         frequent itemsets. We illustrate these concepts with Example 6.2.

            Example 6.2 Closed and maximal frequent itemsets. Suppose that a transaction database has only
                         two transactions: {ha 1 , a 2 ,..., a 100 i; ha 1 , a 2 ,..., a 50 i}. Let the minimum support count
                         threshold be min sup = 1. We find two closed frequent itemsets and their support
                         counts, that is, C = {{a 1 , a 2 ,..., a 100 } : 1; {a 1 , a 2 ,..., a 50 } : 2}. There is only one max-
                         imal frequent itemset: M = {{a 1 , a 2 ,..., a 100 } : 1}. Notice that we cannot include
                         {a 1 , a 2 ,..., a 50 } as a maximal frequent itemset because it has a frequent superset,
                         {a 1 , a 2 ,..., a 100 }. Compare this to the preceding where we determined that there are
                         2 100  − 1 frequent itemsets, which are too many to be enumerated!
                           The set of closed frequent itemsets contains complete information regarding the fre-
                         quent itemsets. For example, from C, we can derive, say, (1) {a 2 , a 45 : 2} since {a 2 , a 45 } is
                         a sub-itemset of the itemset {a 1 , a 2 ,..., a 50 : 2}; and (2) {a 8 , a 55 : 1} since {a 8 , a 55 } is not
                         a sub-itemset of the previous itemset but of the itemset {a 1 , a 2 ,..., a 100 : 1}. However,
                         from the maximal frequent itemset, we can only assert that both itemsets ({a 2 , a 45 } and
                         {a 8 , a 55 }) are frequent, but we cannot assert their actual support counts.



                 6.2     Frequent Itemset Mining Methods


                         In this section, you will learn methods for mining the simplest form of frequent pat-
                         terns such as those discussed for market basket analysis in Section 6.1.1. We begin by
                         presenting Apriori, the basic algorithm for finding frequent itemsets (Section 6.2.1). In
                         Section 6.2.2, we look at how to generate strong association rules from frequent item-
                         sets. Section 6.2.3 describes several variations to the Apriori algorithm for improved
                         efficiency and scalability. Section 6.2.4 presents pattern-growth methods for mining
                         frequent itemsets that confine the subsequent search space to only the data sets contain-
                         ing the current frequent itemsets. Section 6.2.5 presents methods for mining frequent
                         itemsets that take advantage of the vertical data format.


                   6.2.1 Apriori Algorithm: Finding Frequent Itemsets
                         by Confined Candidate Generation
                         Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for min-
                         ing frequent itemsets for Boolean association rules [AS94b]. The name of the algorithm
                         is based on the fact that the algorithm uses prior knowledge of frequent itemset prop-
                         erties, as we shall see later. Apriori employs an iterative approach known as a level-wise
                         search, where k-itemsets are used to explore (k + 1)-itemsets. First, the set of frequent
                         1-itemsets is found by scanning the database to accumulate the count for each item, and
   280   281   282   283   284   285   286   287   288   289   290