Page 283 -
P. 283

HAN 13-ch06-243-278-9780123814791
                                                             2011/6/1

          246   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20  Page 246  #4



                   6.1.2 Frequent Itemsets, Closed Itemsets,
                         and Association Rules
                         Let I = {I 1 , I 2 ,..., I m } be an itemset. Let D, the task-relevant data, be a set of database
                         transactions where each transaction T is a nonempty itemset such that T ⊆ I. Each
                         transaction is associated with an identifier, called a TID. Let A be a set of items. A trans-
                         action T is said to contain A if A ⊆ T. An association rule is an implication of the form
                         A ⇒ B, where A ⊂ I, B ⊂ I, A 6= ∅, B 6= ∅, and A ∩ B = φ. The rule A ⇒ B holds in the
                         transaction set D with support s, where s is the percentage of transactions in D that
                         contain A ∪ B (i.e., the union of sets A and B say, or, both A and B). This is taken to be
                                             1
                         the probability, P(A ∪ B). The rule A ⇒ B has confidence c in the transaction set D,
                         where c is the percentage of transactions in D containing A that also contain B. This is
                         taken to be the conditional probability, P(B|A). That is,


                                                  support(A⇒B) =P(A ∪ B)                  (6.2)
                                                confidence(A⇒B) =P(B|A).                   (6.3)

                         Rules that satisfy both a minimum support threshold (min sup) and a minimum con-
                         fidence threshold (min conf ) are called strong. By convention, we write support and
                         confidence values so as to occur between 0% and 100%, rather than 0 to 1.0.
                                                              2
                           A set of items is referred to as an itemset. An itemset that contains k items is a
                         k-itemset. The set {computer, antivirus software} is a 2-itemset. The occurrence fre-
                         quency of an itemset is the number of transactions that contain the itemset. This is
                         also known, simply, as the frequency, support count, or count of the itemset. Note
                         that the itemset support defined in Eq. (6.2) is sometimes referred to as relative support,
                         whereas the occurrence frequency is called the absolute support. If the relative support
                         of an itemset I satisfies a prespecified minimum support threshold (i.e., the absolute
                         support of I satisfies the corresponding minimum support count threshold), then I is
                                        3
                         a frequent itemset. The set of frequent k-itemsets is commonly denoted by L k . 4
                           From Eq. (6.3), we have


                                                      support(A ∪ B)  support count(A ∪ B)
                             confidence(A⇒B) = P(B|A) =            =                   .   (6.4)
                                                       support(A)     support count(A)

                         1 Notice that the notation P(A ∪ B) indicates the probability that a transaction contains the union of sets
                         A and B (i.e., it contains every item in A and B). This should not be confused with P(A or B), which
                         indicates the probability that a transaction contains either A or B.
                         2 In the data mining research literature, “itemset” is more commonly used than “item set.”
                         3 In early work, itemsets satisfying minimum support were referred to as large. This term, however,
                         is somewhat confusing as it has connotations of the number of items in an itemset rather than the
                         frequency of occurrence of the set. Hence, we use the more recent term frequent.
                         4 Although the term frequent is preferred over large, for historic reasons frequent k-itemsets are still
                         denoted as L k .
   278   279   280   281   282   283   284   285   286   287   288