Page 265 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 265
246 6 Statistical Classification
Both standard deviations, which can be inspected in text boxes for a selected
value of n/d, are initially high for low values of n and converge slowly to zero with
n → ∞ . For the situation shown in Figure 6.15, the standard deviation of eP ˆ d () n
changes from 0.089 for n = d (14 cases, 7 per class) to 0.033 for n = 10d (140
cases, 70 per class).
Based on the behaviour of the Ε[ eP ˆ d ( ) n ] and Ε[ eP ˆ t ( ) n ] curves, some criteria
can be established for the dimensionality ratio. As a general rule of thumb, using
dimensionality ratios well above 3 is recommended.
If the cases are not equally distributed by the classes, it is advisable to use the
smaller number of cases per class as value of n. Notice also that a multi-class
problem can be seen as a generalisation of a two-class problem if every class is
well separated from all the others. Then, the total number of needed training
samples for a given deviation of the expected error estimates from the Bayes error
*
*
can be estimated as cn , where n is the particular value of n that achieves such a
deviation in the most unfavourable, two-class dichotomy of the multi-class
problem.
6.4 The ROC Curve
The classifiers presented in the previous sections assumed a certain model of the
feature vector distributions in the feature space. Other model-free techniques to
design classifiers do not make assumptions about the underlying data distributions.
They are called non-parametric methods. One of these methods is based on the
choice of appropriate feature thresholds by means of the ROC curve method (where
ROC stands for Receiver Operating Characteristic).
The ROC curve method (available with SPSS; see Commands 6.2) appeared in
the fifties as a means of selecting the best voltage threshold discriminating pure
noise from signal plus noise, in signal detection applications such as radar. Since
the seventies, the concept has been used in the areas of medicine and psychology,
namely for test assessment purposes.
The ROC curve is an interesting analysis tool for two-class problems, especially
in situations where one wants to detect rarely occurring events such as a special
signal, a disease, etc., based on the choice of feature thresholds. Let us call the
absence of the event the normal situation (N) and the occurrence of the rare event
the abnormal situation (A). Figure 6.16 shows the classification matrix for this
situation, based on a given decision rule, with true classes along the rows and
5
decided (predicted) classifications along the columns .
5
The reader may notice the similarity of the canonical two-class classification matrix with the
hypothesis decision matrix in chapter 4 (Figure 4.2).