Page 276 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 276

6.6 Classifier Evaluation   257


           Resubstitution method
           The whole set S is used for design, and for testing the classifier. As a consequence
           of the non-independence of design and test sets, the method yields, on average, an
           optimistic estimate of the error, E[ eP ˆ  d  ( ) n ], mentioned in  section  6.3.3. For the
           two-class linear discriminant with  normal distributions  an example of such an
           estimate for various values of n is plotted in Figure 6.15 (lower curve).

           Holdout method
           The available  n samples of  S are  randomly divided into two disjointed sets
           (traditionally with 50% of the samples each), S d and S t used for design and test,
           respectively. The error estimate is obtained from the test set, and therefore, suffers
           from the bias and variance effects previously described. By taking the average over
           many partitions  of the same size, a reliable estimate of the test set error,
           E[ eP ˆ  t  () n ], is obtained (see section 6.3.3). For the two-class linear discriminant
           with normal distributions an example of such an estimate for various values of n is
           plotted in Figure 6.15 (upper curve).

           Partition methods
           Partition methods, also called cross-validation methods divide the available set S
           into a certain number of subsets, which rotate in their use of design and test, as
           follows:

              1.  Divide  S into  k  > 1 subsets  of randomly chosen cases,  with each subset
                 having n/k cases.
              2.  Design the classifier  using the cases of  k –  1 subsets  and test it on the
                 remaining one. A test set estimate Pe ti is thereby obtained.

              3.  Repeat the previous step rotating the position of the test set, obtaining
                 thereby k estimates Pe ti.
              4.  Compute the average test set estimate  Pe t  =  ∑ k = i 1 Pe ti  k /  and the variance
                 of the Pe ti.

              This is the so-called k-fold cross-validation. For k = 2, the method is similar to
           the traditional holdout method. For k = n, the method is called the leave-one-out
           method, with the classifier designed with  n  –  1 samples and tested  on the one
           remaining sample. Since only one sample is being used for testing, the variance of
           the error estimate is large. However, the samples are being used independently for
           design in the best possible way. Therefore the average test set error estimate will
           be a good estimate of the classifier error for sufficiently high  n, since the bias
           contributed by the finiteness of the design set will be low.  For other values of k,
           there is a compromise between the high bias-low variance of the holdout method,
           and the low bias-high  variance  of the leave-one-out method,  with less
           computational effort.
   271   272   273   274   275   276   277   278   279   280   281