Page 305 -
P. 305
2011/6/1
HAN
13-ch06-243-278-9780123814791
268 Chapter 6 Mining Frequent Patterns, Associations, and Correlations 3:20 Page 268 #26
four such measures: all confidence, max confidence, Kulczynski, and cosine. We’ll then
compare their effectiveness with respect to one another and with respect to the lift and
2
χ measures.
Given two itemsets, A and B, the all confidence measure of A and B is defined as
sup(A ∪ B)
all conf(A,B) = = min{P(A|B),P(B|A)}, (6.9)
max{sup(A),sup(B)}
where max{sup(A), sup(B)} is the maximum support of the itemsets A and B. Thus,
all conf(A,B) is also the minimum confidence of the two association rules related to
A and B, namely, “A ⇒ B” and “B ⇒ A.”
Given two itemsets, A and B, the max confidence measure of A and B is defined as
max conf(A, B) = max{P(A|B),P(B|A)}. (6.10)
The max conf measure is the maximum confidence of the two association rules,
“A ⇒ B” and “B ⇒ A.”
Given two itemsets, A and B, the Kulczynski measure of A and B (abbreviated as
Kulc) is defined as
1
Kulc(A, B) = (P(A|B) + P(B|A)). (6.11)
2
It was proposed in 1927 by Polish mathematician S. Kulczynski. It can be viewed as an
average of two confidence measures. That is, it is the average of two conditional prob-
abilities: the probability of itemset B given itemset A, and the probability of itemset A
given itemset B.
Finally, given two itemsets, A and B, the cosine measure of A and B is defined as
P(A ∪ B) sup(A ∪ B)
cosine(A, B) = √ = p
P(A) × P(B) sup(A) × sup(B)
p
= P(A|B) × P(B|A). (6.12)
The cosine measure can be viewed as a harmonized lift measure: The two formulae are
similar except that for cosine, the square root is taken on the product of the probabilities
of A and B. This is an important difference, however, because by taking the square root,
the cosine value is only influenced by the supports of A, B, and A ∪ B, and not by the
total number of transactions.
Each of these four measures defined has the following property: Its value is only
influenced by the supports of A, B, and A ∪ B, or more exactly, by the conditional prob-
abilities of P(A|B) and P(B|A), but not by the total number of transactions. Another
common property is that each measure ranges from 0 to 1, and the higher the value, the
closer the relationship between A and B.
2
Now, together with lift and χ , we have introduced in total six pattern evaluation
measures. You may wonder, “Which is the best in assessing the discovered pattern rela-
tionships?” To answer this question, we examine their performance on some typical
data sets.