Page 332 -
P. 332

306   Chapter 8 ■ Classification


                           not to choose the same item more than once. All the items not selected will be
                           the training set. In principle, this can be repeated arbitrarily many times, but
                           nothing is gained by doing so. Between 5 and 10 trials would be sufficient for
                           the Iris data set. Using random cross validation, keeping the classes balanced,
                           and with 10 examples from each class in the test set, and overall success rate
                           averaged over ten trials, a 93% success rate was obtained. This would be a
                           little different each time due to the random nature of the experiment.
                             What might be called the ultimate in cross validation picks a single sample
                           from the entire set as test data, and uses the rest as training data. This can be
                           repeated for each of the samples in the set, and the average over all trials gives
                           the success rate. For the Iris data, there would be 150 trials, each with a single
                           classification. This is called leave-one-out cross validation, for obvious reasons.
                             For the Iris set again, leave-one-out cross validation leads to an overall
                           success rate of 96% when used with a nearest neighbor classifier; it’s probably
                           the best that can be done. This is a good technique for use with smaller data
                           sets, but is really too expensive for large ones.


                           8.4    Support Vector Machines


                           Section 8.1.3 discussed the concept of a linear discriminant. This is a straight
                           line that divides the feature values into two groups, one for each class, and is
                           an effective way to implement a classifier if such a line can be found. In higher
                           dimensional spaces — that is, if more than two features are involved — this
                           line becomes a plane or a hyperplane. It’s still linear, just complicated by
                           dimensionality. Samples that lie on one side of the plane belong to one class,
                           while those on the other belong to a different class. A support vector machine
                           (SVM) is a nitro-powered version of such a linear discriminant.
                             There are a couple of ways in which an SVM differs from simpler linear
                           classifiers. One is in the fact that an SVM attempts to optimize the line or
                           plane so that it is the best one that can be used. In the situation illustrated in
                           Figure 8.10 there are two classes, white and black. Any of the lines shown in
                           8.10a will work to classify the data, at least the data that is seen there. New
                           data could change the situation, of course. Because of that it would be good
                           to select the line that does the best possible job of dividing the plane into the
                           two areas occupied by the two classes. Such a line is shown in Figure 8.10b.
                           The heavy dark line is the best line, and the thin lines on each side of it show
                           the space between the two classes — the heavy line divides this space evenly
                           into two parts, giving a maximum margin or distance between the groups. The
                           point of an SVM is to find the maximum margin hyperplane. A line divides
                           two-dimensional data into two parts; a plane divides three-dimensional data
                           into two parts; and a hyperplane is a linear function that divides N-dimensional
                           data into two parts. The maximum margin hyperplane is always as far from
                           both data sets as possible.
   327   328   329   330   331   332   333   334   335   336   337