Page 114 -
P. 114

3:15
                                                             2011/6/1
                                                                           Page 77
                                                                                   #39
                          HAN 09-ch02-039-082-9780123814791
                                                       2.4 Measuring Data Similarity and Dissimilarity  77


                               data described by the three attributes of mixed types is:

                                                                        
                                                          0
                                                       0.85   0         
                                                       
                                                                         .
                                                                         
                                                       
                                                       0.65  0.83  0    
                                                         0.13 0.71  0.79 0
                               From Table 2.2, we can intuitively guess that objects 1 and 4 are the most similar, based
                               on their values for test-1 and test-2. This is confirmed by the dissimilarity matrix, where
                               d(4, 1) is the lowest value for any pair of different objects. Similarly, the matrix indicates
                               that objects 1 and 2 are the least similar.


                         2.4.7 Cosine Similarity
                               A document can be represented by thousands of attributes, each recording the frequency
                               of a particular word (such as a keyword) or phrase in the document. Thus, each docu-
                               ment is an object represented by what is called a term-frequency vector. For example, in
                               Table 2.5, we see that Document1 contains five instances of the word team, while hockey
                               occurs three times. The word coach is absent from the entire document, as indicated by
                               a count value of 0. Such data can be highly asymmetric.
                                 Term-frequency vectors are typically very long and sparse (i.e., they have many 0 val-
                               ues). Applications using such structures include information retrieval, text document
                               clustering, biological taxonomy, and gene feature mapping. The traditional distance
                               measures that we have studied in this chapter do not work well for such sparse numeric
                               data. For example, two term-frequency vectors may have many 0 values in common,
                               meaning that the corresponding documents do not share many words, but this does not
                               make them similar. We need a measure that will focus on the words that the two docu-
                               ments do have in common, and the occurrence frequency of such words. In other words,
                               we need a measure for numeric data that ignores zero-matches.
                                 Cosine similarity is a measure of similarity that can be used to compare docu-
                               ments or, say, give a ranking of documents with respect to a given vector of query
                               words. Let x and y be two vectors for comparison. Using the cosine measure as a

                     Table 2.5 Document Vector or Term-Frequency Vector

                               Document team coach hockey baseball soccer penalty score win loss season
                               Document1  5   0     3      0       2     0       0    2   0   0
                               Document2  3   0     2      0       1     1       0    1   0   1
                               Document3  0   7     0      2       1     0       0    3   0   0
                               Document4  0   1     0      0       1     2       2    0   3   0
   109   110   111   112   113   114   115   116   117   118   119