Page 119 -
P. 119
106 4 Statistical Classification
Figure 4.25a shows the classification matrix for the two-class cork stoppers
problem using the whole ten-feature set and equal prevalences. The performance
did not increase significantly compared with the two-feature solution presented
previously, and is worse than the solution using the four-feature vector [ART
PRM NG RAAR]', as shown in Figure 4.25b.
There are, however, further compelling reasons for not using a large number of
features. As a matter of fact, when using estimates of means and covariance
derived from a training set, we are designing a biased classifier, fitted to the
training set (review section 2.6 again). Therefore, we should expect that our
training set error estimates are, on average, optimistic. On the other hand, error
estimates obtained in independent test sets are expected to be, on average,
pessimistic.
The bibliography at the end of the present chapter includes references
explaining the mathematical details of this issue. We present here some important
results as an aid for the designer to choose sensible values for the dimensionality
ratio, nld. Later, when we discuss the topic of classifier evaluation, we will come
back to this issue from another perspective.
Let us denote:
Pe - Probability of error of an optimum Bayesian classifier.
PeXn) - Training (design) set estimate of Pe for n patterns.
Pe,(n) - Test set estimate of Pe for n patterns.
Thus, the quantity Pedn) represents an estimate of Pe influenced only by the
finite size of the design set, i.e., the classifier error is measured exactly and its
deviation from Pe is due solely to the finiteness of the design set; the quantity
Pe,(n) represents an estimate of Pe influenced only by the finite size of the test set,
i.e., the error of the Bayesian classifier is estimated by counting how many of n
patterns are misclassified. These quantities verify that Ped(w)=Pe and Pe,(w)=Pe,
i.e., they converge to the theoretical value Pe with increasing values of n.
In normal practice, these error probabilities are not known exactly. Instead, we
compute estimates of these probabilities, Ped and Pe, , as percentages of
misclassified patterns, in exactly the same way as we have done in the
classification matrices presented so far. The probability of obtaining k
misclassified patterns out of n for a classifier with a theoretical error Pe, is given
by the binomial law:
The maximum likelihood estimation of Pe under this binomial law is precisely:
The standard deviation of this estimate is simply: