Page 311 -
P. 311

13-ch06-243-278-9780123814791
                                                            2011/6/1
                         HAN
          274   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 274  #32



                                                   TID    items bought
                                                   T100   {M, O, N, K, E, Y}
                                                   T200   {D, O, N, K, E, Y }
                                                   T300   {M, A, K, E}
                                                   T400   {M, U, C, K, Y}
                                                   T500   {C, O, O, K, I, E}

                             (a) Find all frequent itemsets using Apriori and FP-growth, respectively. Compare
                                the efficiency of the two mining processes.
                            (b) List all the strong association rules (with support s and confidence c) matching
                                the following metarule, where X is a variable representing customers, and item i
                                denotes variables representing items (e.g., “A,” “B,”):

                                 ∀x ∈ transaction, buys(X,item 1 ) ∧ buys(X,item 2 ) ⇒ buys(X,item 3 ) [s,c]
                         6.7 (Implementation project) Using a programming language that you are familiar
                             with, such as C++ or Java, implement three frequent itemset mining algorithms
                             introduced in this chapter: (1) Apriori [AS94b], (2) FP-growth [HPY00], and
                             (3) Eclat [Zak00] (mining using the vertical data format). Compare the perfor-
                             mance of each algorithm with various kinds of large data sets. Write a report to
                             analyze the situations (e.g., data size, data distribution, minimal support thresh-
                             old setting, and pattern density) where one algorithm may perform better than the
                             others, and state why.

                         6.8 A database has four transactions. Let min sup = 60% and min conf = 80%.

                          cust ID TID items bought (in the form of brand-item category)
                          01     T100 {King’s-Crab, Sunset-Milk, Dairyland-Cheese, Best-Bread}
                          02     T200 {Best-Cheese, Dairyland-Milk, Goldenfarm-Apple, Tasty-Pie, Wonder-Bread}
                          01     T300 {Westcoast-Apple, Dairyland-Milk, Wonder-Bread, Tasty-Pie}
                          03     T400 {Wonder-Bread, Sunset-Milk, Dairyland-Cheese}


                           (a) At the granularity of item category (e.g., item i could be “Milk”), for the rule
                               template,

                                ∀X ∈ transaction, buys(X,item 1 ) ∧ buys(X,item 2 ) ⇒ buys(X,item 3 ) [s,c],
                               list the frequent k-itemset for the largest k, and all the strong association rules
                               (with their support s and confidence c) containing the frequent k-itemset for the
                               largest k.
                           (b) At the granularity of brand-item category (e.g., item i could be “Sunset-Milk”),
                               for the rule template,

                                    ∀X ∈ customer, buys(X,item 1 ) ∧ buys(X,item 2 ) ⇒ buys(X,item 3 ),

                               list the frequent k-itemset for the largest k (but do not print any rules).
   306   307   308   309   310   311   312   313   314   315   316