Page 328 - Introduction to Statistical Pattern Recognition

P. 328

3 10 Introduction to Statistical Pattern Recognition

(7.25)

Equation (7.25) indicates that the NN error is still less than twice the Bayes
error, but the upper bound becomes larger as L increases.

Estimation of kNN Errors

When samples are generated by a computer as in controlled experiments,
it is generally desirable to have independent design and test sets. First, design
samples are generated and stored. Then, test samples are independently gen-
erated, their kNN’s are found from the design set, and the number of
misclassified test samples is counted. This is the holdour (H) method for error
estimation.
However, with only one set of samples available in practice, we face the
problem of deciding whether the sample set should be divided into two groups
or used as one group for both design and test. In the parametric case, the latter
approach, the R method, produces an optimistic bias, as was seen in Chapter 5.
However, in the voting kNN procedure, we may get different results.

Table 7-l(a) shows how the 3NN error can be estimated from a single
sample set without dividing it into separate design and test sets. The data
column, which lists the samples and their true classes, is given. The 3NNs of
each sample are found and listed with their classes in the 1st-NN, 2nd-NN, and
3rd-NN columns. Classification is the result of majority voting among the
classes of the 3NN’s. Then, the classification column is compared with the
true class of the data column. If the classification result matches the true class,
the sample is labeled as correct. Otherwise, it is considered an error. The esti-
mate of the 3NN error is obtained by counting the number of errors and divid-
ing this by the total number of samples.
Let us examine the first row. When XI is tested, XI is not included in
the design set from which the 3NN’s of X I are selected. Therefore, this opera-
tion utilizes the leave-one-out method. On the other hand, in the resubstitution
method, XI must be included in the design set. Since XI is the closest neigh-
bor of XI itself, the 1st-NN column of Table 7-l(b) is identical to the data
column, and thc 1st- and 2nd-NN columns of Table 7-l(a) are shifted to the
2nd- and 3rd-NN columns in Table 7-l(b). Now, applying the voting 3NN

323 324 325 326 327 328 329 330 331 332 333