Page 141 -

P. 141

128 4 Statistical Classification

normal distributions, an example of such an estimate for various values of n is
plotted in Figure 4.26 (lower curve).

Holdout method

The available n samples of X are randomly divided into two disjointed sets
(traditionally with 50% of the samples each), Xd and X, used for design and test,
respectively. The error estimate is obtained from the test set, and therefore suffers
from the bias and variance effects previously described. By taking the average over
many partitions of the same size, a reliable estimate of the design set error with the
bias of (4-46a) can be obtained, which is an upper bound of the Bayes error and
corresponds to the estimate E[ Pel (n)], mentioned to in section 4.2.4. For the two-
class linear discriminant with normal distributions, an example of such an estimate
for various values of n is plotted in Figure 4.26 (upper curve).

Partition methods
Partition methods divide the available set X into a certain number of subsets, which
rotate in their use of design and test, as follows:

1. Divide X into k>l subsets of randomly chosen patterns, with each subset having
n/k patterns.
2. Design the classifier using the patterns of k-1 subsets and test it on the
remaining one. A test set estimate Pe,i is thereby obtained.
3. Repeat the previous step rotating the position of the test set, obtaining thereby k
estimates Pe,,.
k
4. Compute the average test set estimate Pel = zi=, Pel; l k and the variance of
the Pe,;.

For k=2, the method is similar to the traditional holdout method. For k=n, the
method is called the leave-one-out method, with the classifier designed with n-l
samples and tested on the one remaining sample. Since only one sample is being
used for testing, the variance of the error estimate is large. However, the samples
are being used independently for design in the best possible way, therefore the
average test set error estimate will be a good estimate of the classifier error for
sufficiently high n, since the bias contributed by the finiteness of the design set will
be low. For other values of k there is a compromise between the high bias-low
variance of the holdout method, and the low bias-high variance of the leave-one-
out method, with less computational effort.

Bootstrap method

Bootstrap methods are based on the generation of artificial samples (bootstrap
samples) by randomly drawing the existing samples with uniform distribution
within each class. The error estimate is computed in the original set with the
classifier designed using large sets of the bootstrap samples. It can be shown that

136 137 138 139 140 141 142 143 144 145 146