Page 115 -
P. 115
HAN 09-ch02-039-082-9780123814791
78 Chapter 2 Getting to Know Your Data 2011/6/1 3:15 Page 78 #40
similarity function, we have
x · y
sim(x, y) = , (2.23)
||x||||y||
where ||x|| is the Euclidean norm of vector x = (x 1 , x 2 ,..., x p ), defined as
q
2
2
2
x + x + ··· + x . Conceptually, it is the length of the vector. Similarly, ||y|| is the
2
1
p
Euclidean norm of vector y. The measure computes the cosine of the angle between vec-
tors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each
other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the
angle and the greater the match between vectors. Note that because the cosine similarity
measure does not obey all of the properties of Section 2.4.4 defining metric measures, it
is referred to as a nonmetric measure.
Example 2.23 Cosine similarity between two term-frequency vectors. Suppose that x and y are the
first two term-frequency vectors in Table 2.5. That is, x = (5,0,3,0,2,0,0,2,0,0) and
y = (3,0,2,0,1,1,0,1,0,1). How similar are x and y? Using Eq. (2.23) to compute the
cosine similarity between the two vectors, we get:
t
x · y = 5 × 3 + 0 × 0 + 3 × 2 + 0 × 0 + 2 × 1 + 0 × 1 + 0 × 0 + 2 × 1
+ 0 × 0 + 0 × 1 = 25
p
2
2
2
2
2
2
2
2
2
2
||x|| = 5 + 0 + 3 + 0 + 2 + 0 + 0 + 2 + 0 + 0 = 6.48
p
2
2
2
2
2
2
2
2
2
2
||y|| = 3 + 0 + 2 + 0 + 1 + 1 + 0 + 1 + 0 + 1 = 4.12
sim(x, y) = 0.94
Therefore, if we were using the cosine similarity measure to compare these documents,
they would be considered quite similar.
When attributes are binary-valued, the cosine similarity function can be interpreted
in terms of shared features or attributes. Suppose an object x possesses the ith attribute
t
if x i = 1. Then x · y is the number of attributes possessed (i.e., shared) by both x and
y, and |x||y| is the geometric mean of the number of attributes possessed by x and the
number possessed by y. Thus, sim(x, y) is a measure of relative possession of common
attributes.
A simple variation of cosine similarity for the preceding scenario is
x · y
sim(x, y) = , (2.24)
x · x + y · y − x · y
which is the ratio of the number of attributes shared by x and y to the number of
attributes possessed by x or y. This function, known as the Tanimoto coefficient or
Tanimoto distance, is frequently used in information retrieval and biology taxonomy.