Page 115 -
P. 115

HAN 09-ch02-039-082-9780123814791


          78    Chapter 2 Getting to Know Your Data          2011/6/1  3:15  Page 78  #40



                         similarity function, we have

                                                              x · y
                                                   sim(x, y) =     ,                     (2.23)
                                                            ||x||||y||

                         where ||x|| is the Euclidean norm of vector x = (x 1 , x 2 ,..., x p ), defined as
                         q
                                        2
                           2
                               2
                          x + x + ··· + x . Conceptually, it is the length of the vector. Similarly, ||y|| is the
                               2
                           1
                                        p
                         Euclidean norm of vector y. The measure computes the cosine of the angle between vec-
                         tors x and y. A cosine value of 0 means that the two vectors are at 90 degrees to each
                         other (orthogonal) and have no match. The closer the cosine value to 1, the smaller the
                         angle and the greater the match between vectors. Note that because the cosine similarity
                         measure does not obey all of the properties of Section 2.4.4 defining metric measures, it
                         is referred to as a nonmetric measure.
           Example 2.23 Cosine similarity between two term-frequency vectors. Suppose that x and y are the
                         first two term-frequency vectors in Table 2.5. That is, x = (5,0,3,0,2,0,0,2,0,0) and
                         y = (3,0,2,0,1,1,0,1,0,1). How similar are x and y? Using Eq. (2.23) to compute the
                         cosine similarity between the two vectors, we get:

                                   t
                                  x · y = 5 × 3 + 0 × 0 + 3 × 2 + 0 × 0 + 2 × 1 + 0 × 1 + 0 × 0 + 2 × 1
                                        + 0 × 0 + 0 × 1 = 25
                                        p
                                          2
                                                                      2
                                                                  2
                                                                              2
                                                                          2
                                                              2
                                                  2
                                              2
                                                          2
                                                      2
                                  ||x|| =  5 + 0 + 3 + 0 + 2 + 0 + 0 + 2 + 0 + 0 = 6.48
                                        p
                                                  2
                                                          2
                                                              2
                                                                          2
                                                      2
                                                                              2
                                          2
                                                                      2
                                              2
                                                                  2
                                  ||y|| =  3 + 0 + 2 + 0 + 1 + 1 + 0 + 1 + 0 + 1 = 4.12
                              sim(x, y) = 0.94
                         Therefore, if we were using the cosine similarity measure to compare these documents,
                         they would be considered quite similar.
                           When attributes are binary-valued, the cosine similarity function can be interpreted
                         in terms of shared features or attributes. Suppose an object x possesses the ith attribute
                                      t
                         if x i = 1. Then x · y is the number of attributes possessed (i.e., shared) by both x and
                         y, and |x||y| is the geometric mean of the number of attributes possessed by x and the
                         number possessed by y. Thus, sim(x, y) is a measure of relative possession of common
                         attributes.
                           A simple variation of cosine similarity for the preceding scenario is
                                                              x · y
                                               sim(x, y) =             ,                 (2.24)
                                                         x · x + y · y − x · y
                         which is the ratio of the number of attributes shared by x and y to the number of
                         attributes possessed by x or y. This function, known as the Tanimoto coefficient or
                         Tanimoto distance, is frequently used in information retrieval and biology taxonomy.
   110   111   112   113   114   115   116   117   118   119   120