Page 305 -
P. 305

2011/6/1
                         HAN
                               13-ch06-243-278-9780123814791
          268   Chapter 6 Mining Frequent Patterns, Associations, and Correlations  3:20 Page 268  #26



                         four such measures: all confidence, max confidence, Kulczynski, and cosine. We’ll then
                         compare their effectiveness with respect to one another and with respect to the lift and
                          2
                         χ measures.
                           Given two itemsets, A and B, the all confidence measure of A and B is defined as
                                                   sup(A ∪ B)
                                  all conf(A,B) =                = min{P(A|B),P(B|A)},    (6.9)
                                               max{sup(A),sup(B)}
                         where max{sup(A), sup(B)} is the maximum support of the itemsets A and B. Thus,
                         all conf(A,B) is also the minimum confidence of the two association rules related to
                         A and B, namely, “A ⇒ B” and “B ⇒ A.”
                           Given two itemsets, A and B, the max confidence measure of A and B is defined as

                                           max conf(A, B) = max{P(A|B),P(B|A)}.          (6.10)
                         The max conf measure is the maximum confidence of the two association rules,
                         “A ⇒ B” and “B ⇒ A.”
                           Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as
                         Kulc) is defined as
                                                         1
                                             Kulc(A, B) = (P(A|B) + P(B|A)).             (6.11)
                                                         2
                         It was proposed in 1927 by Polish mathematician S. Kulczynski. It can be viewed as an
                         average of two confidence measures. That is, it is the average of two conditional prob-
                         abilities: the probability of itemset B given itemset A, and the probability of itemset A
                         given itemset B.
                           Finally, given two itemsets, A and B, the cosine measure of A and B is defined as
                                                     P(A ∪ B)       sup(A ∪ B)
                                      cosine(A, B) = √         = p
                                                    P(A) × P(B)    sup(A) × sup(B)
                                                   p
                                                 =  P(A|B) × P(B|A).                     (6.12)
                         The cosine measure can be viewed as a harmonized lift measure: The two formulae are
                         similar except that for cosine, the square root is taken on the product of the probabilities
                         of A and B. This is an important difference, however, because by taking the square root,
                         the cosine value is only influenced by the supports of A, B, and A ∪ B, and not by the
                         total number of transactions.
                           Each of these four measures defined has the following property: Its value is only
                         influenced by the supports of A, B, and A ∪ B, or more exactly, by the conditional prob-
                         abilities of P(A|B) and P(B|A), but not by the total number of transactions. Another
                         common property is that each measure ranges from 0 to 1, and the higher the value, the
                         closer the relationship between A and B.
                                                   2
                           Now, together with lift and χ , we have introduced in total six pattern evaluation
                         measures. You may wonder, “Which is the best in assessing the discovered pattern rela-
                         tionships?” To answer this question, we examine their performance on some typical
                         data sets.
   300   301   302   303   304   305   306   307   308   309   310