Page 331 -

P. 331

Chapter 8 ■ Classiﬁcation 305

on the training data should be 100%, because each of the points will be a
distance of zero from at least one other — itself. Still, the selection of training
versus test data is arbitrary, and the two data sets could be exchanged without
distorting the results. If this is done for the Iris data using the nearest neighbor
classiﬁer, the results become as follows:

SETOSA VERSICOLOR VIRGINICA
SETOSA 25 0 0
VERSICOLOR 0 23 2
VIRGINICA 0 2 23

The success rate is the same as before, but the details of the confusion matrix
are different. Repeating the classiﬁcation with the roles of the testing and
training sets reversed gives us two different trials, though, and should give
us more conﬁdence, especially since there is relatively little data here. This
process could be described as a 2-way (or 2-fold) cross validation.
The general description of cross validation is a process for partitioning data
repeatedly into distinct training and testing sets. There are many ways to do
this, some of them wrong. Any partition that uses the same samples in both
sets would normally be in error, for example, and creating new data points
based on statistical samples may in some instances be ﬁne, but is not cross
validated. Cross validation takes the data that exists and partitions it into
training/testing sets multiple times so that the sets are different.
An n-way cross validation breaks the data into n more-or-less equal parts. Then
each of these in turn is used as test data, while all the other parts together are
used as training data. This gives n results, and the overall result is the average
of those n. The Iris data set has 150 samples in all, so a 5-way cross validation
would provide a convenient partitioning into 5 groups of 30 points each. There
is no rule that says therehavetobeexactly thesamenumberofsamples in
each set, although there should be as many examples of each class as possible.
The program cross5.cworks the same way as the nearest neighbor program,
except that it reads all the Iris data into one large array at the beginning and
then partitions it before each experiment. The result is ﬁve distinct experiments
with ﬁve confusion matrices and success rates:

PARTITION 1 PARTITION 2 PARTITION 3 PARTITION 4 PARTITION 5
SUCCESS 96.7 96.7 93.3 93.3 100.0

This yields an average of 96%.
Cross validation can be done using random samples of the data, too. A test
set would be built from random selections of the full data set, making sure

326 327 328 329 330 331 332 333 334 335 336