Page 328 - Introduction to Statistical Pattern Recognition
P. 328

3 10                       Introduction to Statistical Pattern Recognition



                                                                                  (7.25)


                       Equation  (7.25) indicates that  the NN  error  is  still  less than  twice the  Bayes
                       error, but the upper bound becomes larger as L increases.


                       Estimation of kNN Errors

                            When samples are generated by  a computer as in controlled experiments,
                       it is generally desirable to have independent design and test sets.  First, design
                       samples are generated and  stored.  Then, test samples are independently gen-
                       erated,  their  kNN’s  are  found  from  the  design  set,  and  the  number  of
                       misclassified test samples is counted.  This is the holdour (H) method for error
                       estimation.
                            However, with only one set of  samples available in practice, we face the
                       problem of deciding whether the sample set should be divided into two groups
                       or used as one group for both design and test.  In the parametric case, the latter
                       approach, the R method, produces an optimistic bias, as was seen in Chapter 5.
                       However, in the voting kNN procedure, we may get different results.

                            Table 7-l(a) shows how  the  3NN  error can  be  estimated from a  single
                       sample  set  without  dividing  it  into  separate design  and  test  sets.  The  data
                       column, which lists the samples and their true classes, is given.  The 3NNs of
                       each sample are found and listed with  their classes in the  1st-NN, 2nd-NN, and
                       3rd-NN  columns.  Classification  is  the  result  of  majority  voting  among  the
                       classes of  the  3NN’s.  Then, the  classification column  is  compared with  the
                       true class of the data column.  If  the classification result matches the true class,
                       the sample is labeled as correct.  Otherwise, it is considered an error.  The esti-
                       mate of  the 3NN error is obtained by  counting the number of  errors and  divid-
                       ing this by  the total number of  samples.
                            Let us examine the first row.  When XI is tested, XI is not  included in
                       the design set from which the 3NN’s of X I  are selected.  Therefore, this opera-
                       tion  utilizes the leave-one-out method.  On the other hand, in the resubstitution
                       method, XI must be  included in the design set.  Since XI is the closest neigh-
                       bor of  XI itself, the  1st-NN column  of  Table 7-l(b) is  identical to  the  data
                       column, and  thc  1st- and  2nd-NN columns of  Table 7-l(a) are shifted to  the
                       2nd-  and  3rd-NN  columns  in  Table  7-l(b).  Now,  applying the  voting  3NN
   323   324   325   326   327   328   329   330   331   332   333