Page 109 -
P. 109

HAN 09-ch02-039-082-9780123814791


          72    Chapter 2 Getting to Know Your Data          2011/6/1  3:15  Page 72  #34



                         (patients) is computed based only on the asymmetric attributes. According to Eq. (2.14),
                         the distance between each pair of the three patients—Jack, Mary, and Jim—is
                                                             1 + 1
                                               d(Jack, Jim) =      = 0.67,
                                                           1 + 1 + 1
                                                             0 + 1
                                              d(Jack, Mary) =      = 0.33,
                                                           2 + 0 + 1
                                                             1 + 2
                                              d(Jim, Mary) =       = 0.75.
                                                           1 + 1 + 2
                         These measurements suggest that Jim and Mary are unlikely to have a similar disease
                         because they have the highest dissimilarity value among the three pairs. Of the three
                         patients, Jack and Mary are the most likely to have a similar disease.


                   2.4.4 Dissimilarity of Numeric Data: Minkowski Distance

                         In this section, we describe distance measures that are commonly used for computing
                         the dissimilarity of objects described by numeric attributes. These measures include the
                         Euclidean, Manhattan, and Minkowski distances.
                           In some cases, the data are normalized before applying distance calculations. This
                         involves transforming the data to fall within a smaller or common range, such as [−1,1]
                         or [0.0, 1.0]. Consider a height attribute, for example, which could be measured in either
                         meters or inches. In general, expressing an attribute in smaller units will lead to a larger
                         range for that attribute, and thus tend to give such attributes greater effect or “weight.”
                         Normalizing the data attempts to give all attributes an equal weight. It may or may not be
                         useful in a particular application. Methods for normalizing data are discussed in detail
                         in Chapter 3 on data preprocessing.
                           The most popular distance measure is Euclidean distance (i.e., straight line or
                         “as the crow flies”). Let i = (x i1 , x i2 ,..., x ip ) and j = (x j1 , x j2 ,..., x jp ) be two objects
                         described by p numeric attributes. The Euclidean distance between objects i and j is
                         defined as

                                            q
                                                      2
                                                                                2
                                                                 2
                                     d(i, j) =  (x i1 − x j1 ) + (x i2 − x j2 ) + ··· + (x ip − x jp ) .  (2.16)
                         Another well-known measure is the Manhattan (or city block) distance, named so
                         because it is the distance in blocks between any two points in a city (such as 2 blocks
                         down and 3 blocks over for a total of 5 blocks). It is defined as
                                        d(i, j) = |x i1 − x j1 | + |x i2 − x j2 | + ··· + |x ip − x jp |.  (2.17)
                         Both the Euclidean and the Manhattan distance satisfy the following mathematical
                         properties:
                         Non-negativity: d(i, j) ≥ 0: Distance is a non-negative number.
                         Identity of indiscernibles: d(i, i) = 0: The distance of an object to itself is 0.
   104   105   106   107   108   109   110   111   112   113   114