Page 328 -
P. 328

302   Chapter 8 ■ Classification


                             When two (or more) values are involved, this calculation can include
                           combinations of the variables. In the case of P and Q:
                                                          n

                                                            (P i − µ i )(Q i − µ i )
                                                         i = 1
                                                  COV =                                    (EQ 8.7)
                                                                n − 1

                             So, covariance is a generalization of variance for multiple variables. The
                           Mahanalobis distance is much more computationally expensive than the other
                           distance measures, but it does have the important advantage of being scale
                           independent, so is often used. However, for simplicity many people use
                           Euclidean distance, too, and without loss of generality most of the rest of
                           the examples will use Euclidean distance. Any distance measure may be
                           substituted, of course.


                           8.2.2    Distances Between Features
                           Many pattern recognition tasks use a large number of features to distinguish
                           between many classes. The Iris data set has four features, which is too many
                           to visualize in a straightforward way, to characterize three classes. This data
                           set will be used to illustrate distance-based classifiers, starting with the nearest
                           neighbor classifier.
                             Given N classes C 1 , C 2 , ... , C N and M features F 1 .. F M ,consider the clas-
                           sification of an object, P. Measure all features for this object and create an
                           M-dimensional vector, v, from them. Feature vectors for all objects in all N
                                                                                        1
                           classes have also been created; the first such in class C 1 will be C 1 ,the eighth
                                                 8
                           one in class 3 will be C 3 , and so on. Classification of P by the nearest neighbor
                           method involves calculating the distances between v and all feature vectors
                           for all the classes. The class of the feature vector having the minimum distance
                           from v will be assigned to v.
                             The name of the method is very descriptive. The class of an unknown target
                           will be the same as that of its nearest neighbor in feature space. Let’s see how
                           this works using the Iris data set. First, the set needs to be broken into training
                           data and test data: select the first half of the data for each class to be training
                           data, and the last half as test data.
                             Next, feature vectors are created from the training data items. There are
                           four features, so each vector has four components. This vector is compared
                           against (i.e., the distance is computed to) all the training data vectors, and the
                           class of the one with smallest distance is saved: this will be the class given to
                           the target. This is done for each of the test data items, and success rates are
                           computed; the raw success rate, the number of correct classifications divided
                           by the number of test data items, is a good indicator of how good the features
                           are and of how well the classifier will work overall.
   323   324   325   326   327   328   329   330   331   332   333