Page 327 -
P. 327

Chapter 8 ■ Classification  301


                               entries, tomatoes both, and the area and the green component are used in a
                               classification, then the data points are:
                                                             P = (1634, 46)
                                                            Q = (1384, 53)
                                 The Euclidean distance between these two is:
                                                                        √

                                                         2          2
                                             (1634 − 1384) + (46 − 53) =  62500 + 49 − 250.1
                                 Now change the green component of P by 1 to (1634, 45). The distance
                               between P and Q is now 250.13. Changing the area component by 1 so that
                               P = (1635, 46) changes the P − Q distance to 251.1. This shows that a change
                               in the first coordinate makes a bigger difference in the distance than does a
                               change in the second. Or in other words, the scales of the two coordinate axes
                               are different. This is very common in computer vision problems, and it really
                               does make sense. Why would we expect that each of the measurements would
                               have units of the same size?
                                 Normalizing with respect to scale can be done using statistics. The standard
                               deviation is a measure of variability, or what the range of values is. Dividing
                               sample values by the standard deviation should narrow the range of values,
                               and convert the units to universal ones. This is the basic idea behind Mahanalo-
                               bis distance. For example, consider the same points P and Q as before and the
                               normalized points P’and Q’. The overall standard deviations are:
                                                      s area = 429.5  s green = 25.2

                                 The points are:
                                        P = (1634, 46)   Q = (1384, 53)   distance (P, Q) = 250.1




                                        P = (3.8, 1.83)  Q = (3.2, 2.1)   distance (P , Q ) = 0.64
                                 The standard deviations are used to normalize the raw sample values before
                               computing distance. It’s actually more complex than that; reality tends to
                               make the math harder. The formula for computing the Mahanalobis distance
                               between P and Q is:
                                                                       T −1
                                                    d M (P, Q) =  (P − Q) S (P − Q)            (EQ 8.5)
                               which is a matrix equation, in which P and Q are the points (vectors) for which
                                                                    T
                               the distance is being computed, (P − Q) is the transpose of the difference of
                               the vectors, and S is the covariance matrix.
                                 The variance is the mean of the squared distances between a value and the
                               mean of those values:
                                                               n

                                                                (P i − µ i )(P i − µ i )
                                                              i = 1
                                                      VAR =                                    (EQ 8.6)
                                                                    n − 1
   322   323   324   325   326   327   328   329   330   331   332