Page 272 - Applied Statistics Using SPSS, STATISTICA, MATLAB and R
P. 272

6.5 Feature Selection   253


           A: The ROC curves for ALTV and ASTV are shown in Figure 6.20. The areas
           under the ROC curve, computed by SPSS with a 95% confidence interval, are
           0.709 ± 0.11 and 0.781 ± 0.10 for ALTV and ASTV, respectively. We, therefore,
           select the ASTV parameter as the best diagnostic feature.


           6.5 Feature Selection


           As already discussed in section 6.3.3, great care must be exercised in reducing the
           number of features used by a classifier, in order to maintain a high dimensionality
           ratio and, therefore, reproducible performance, with error estimates sufficiently
           near the theoretical value.  For this  purpose, one may use the  hypothesis test
           methods described in chapters 4 and 5 with the aim of discarding features that are
           clearly non-useful at an initial stage of the classifier design. This feature
           assessment task, while assuring that an information-carrying feature set is indeed
           used in the classifier, does not guarantee it will need the whole set. Consider, for
           instance, that  we are  presented with  a  classification problem  described by 4
           features, x 1, x 2, x 3 and x 4, with x 1 and x 2 perfectly discriminating the classes, and x 3
           and x 4 being linearly dependent of x 1 and x 2. The hypothesis tests will then find that
           all features contribute to class discrimination. However, this discrimination could
           be performed equally well using the alternative sets {x 1, x 2} or {x 3, x 4}. Briefly,
           discarding features with no aptitude for class discrimination is no guarantee against
           redundant features.
              There is abundant literature on the topic of feature selection (see References).
           Feature selection uses a search procedure of a feature subset (model) obeying a
           stipulated merit criterion.  A possible choice for this criterion is minimising  Pe,
           with the disadvantage of the search process depending on the classifier type. More
           often, a class separability  criterion such as the Bhattacharyya  distance  or the
           ANOVA  F statistic  is used. The  Wilks’ lambda,  defined as the ratio of the
           determinant of the pooled covariance over the determinant of the total covariance,
           is also a popular criterion. Physically, it can be interpreted as the ratio between the
           average class volume and the total volume of all cases. Its value will range from 0
           (complete class separation) to 1 (complete class fusion).
              As for the  search method,  the following  are popular ones and available in
           STATISTICA and SPSS:

           1. Sequential search (direct)
           The  direct sequential search corresponds to performing successive feature
           additions or eliminations to the target set, based on a separability criterion.
              In a forward search, one starts with the feature of most merit and, at each step,
           all the features not yet included in the subset are revised; the one that contributes
           the most  to class discrimination is evaluated through the merit criterion. This
           feature is then included in the subset and the procedure advances to the next search
           step. The  process goes  on  until the merit  criterion for any candidate  feature is
           below a specified threshold.
   267   268   269   270   271   272   273   274   275   276   277