Page 114 -

P. 114

3:15
2011/6/1
Page 77
#39
HAN 09-ch02-039-082-9780123814791
2.4 Measuring Data Similarity and Dissimilarity 77

data described by the three attributes of mixed types is:

 
0
0.85 0 

.


0.65 0.83 0 
0.13 0.71 0.79 0
From Table 2.2, we can intuitively guess that objects 1 and 4 are the most similar, based
on their values for test-1 and test-2. This is conﬁrmed by the dissimilarity matrix, where
d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates
that objects 1 and 2 are the least similar.

2.4.7 Cosine Similarity
A document can be represented by thousands of attributes, each recording the frequency
of a particular word (such as a keyword) or phrase in the document. Thus, each docu-
ment is an object represented by what is called a term-frequency vector. For example, in
Table 2.5, we see that Document1 contains ﬁve instances of the word team, while hockey
occurs three times. The word coach is absent from the entire document, as indicated by
a count value of 0. Such data can be highly asymmetric.
Term-frequency vectors are typically very long and sparse (i.e., they have many 0 val-
ues). Applications using such structures include information retrieval, text document
clustering, biological taxonomy, and gene feature mapping. The traditional distance
measures that we have studied in this chapter do not work well for such sparse numeric
data. For example, two term-frequency vectors may have many 0 values in common,
meaning that the corresponding documents do not share many words, but this does not
make them similar. We need a measure that will focus on the words that the two docu-
ments do have in common, and the occurrence frequency of such words. In other words,
we need a measure for numeric data that ignores zero-matches.
Cosine similarity is a measure of similarity that can be used to compare docu-
ments or, say, give a ranking of documents with respect to a given vector of query
words. Let x and y be two vectors for comparison. Using the cosine measure as a

Table 2.5 Document Vector or Term-Frequency Vector

Document team coach hockey baseball soccer penalty score win loss season
Document1 5 0 3 0 2 0 0 2 0 0
Document2 3 0 2 0 1 1 0 1 0 1
Document3 0 7 0 2 1 0 0 3 0 0
Document4 0 1 0 0 1 2 2 0 3 0

109 110 111 112 113 114 115 116 117 118 119