Page 329 -

P. 329

Chapter 8 ■ Classiﬁcation 303

There is another, better, way to evaluate the results. A confusion matrix is a
table in which each column represents an actual class, and each row represents
a class delivered by the classiﬁer (an outcome). For the Iris data experiment,
the confusion matrix is:

SETOSA VERSICOLOR VIRGINICA
SETOSA 25 0 0
VERSICOLOR 0 24 3
VIRGINICA 0 1 22

The columns add up to the number of elements in each class, and the rows
add up to the number of classiﬁcations that the classiﬁer made to each class.
The trace (sum of the elements along the diagonal) is the number of correct
classiﬁcations, and the success rate is the trace divided by the total number of
trials. In this instance, the success rate is nearly 95%, which is pretty good.
The nearest neighbor method is commonly implemented for a classiﬁer
because it is simple and gives pretty good results. However, if one neighbor
gives good results, why not use many neighbors? This simple thought leads
to the k-nearest neighbor method, in which the class is determined by a vote
between the nearest k neighbors in feature space. This is a little more work,
and can lead to ties in some cases. There are a two main ways to implement
this: compute all distances and then sort them into descending order and
read off the smallest k of them, or keep only the smallest k in a table and
test/insert after every distance calculation. The example program provided
(nkn.c) uses the ﬁrst method. This allows the speciﬁcation of k to change
without much modiﬁcation of the program so that the effect of changes to k
can be explored.
The k-nearest neighbor algorithm should yield the same results as nearest
neighbor for k = 1; this is a test of correctness. The results for the Iris data are
as follows:

K SUCCESS K SUCCESS
1 95% 12 93%

2 92% 13 95%
3 93% 14 93%
4 95% 15 93%
5 92% 16 95%

6 92% 17 96%

Continued

324 325 326 327 328 329 330 331 332 333 334