Page 335 - Introduction to Statistical Pattern Recognition
P. 335

7  Nonparametric Classification and Error Estimation          317


                          Effect of dimensionality: The dimensionality of the data appears to play
                     an  important role in  determining the relationship between the  size of  the  bias
                     and the  sample size.  As  is  shown in  Fig. 7-5, for small values of  n  (say, n  I
                     4), changing the sample size is an effective means of  reducing the bias.  For
                     larger values of  n, however, increasing the number of samples becomes a more
                     and more futile means of improving the estimate.  It is in these higher dimen-
                     sional cases that improved techniques of accurately estimating the Bayes error
                     are needed.  It  should be pointed out that, in the expression for the bias of  the
                     NN error, n  represents the  local or intrinsic dimensionality of  the data as dis-
                     cussed  in  Chapter  6.  In  many  applications, the  intrinsic  dimensionality is
                     much  smaller than the dimensionality of the observation space.  Therefore, in
                     order  to  calculate  PI,  it  is  necessary  that  the  intrinsic  dimensionality  be
                     estimated from the data using (6.1 15).

                          Effect of  densities: The expectation term  of  (7.35) gives the  effect of
                     densities on  the size of  the bias.  In general, it  is  very  hard to determine the
                     effect of  this term because of  its complexity.  In order to investigate the gen-
                     eral trends, however, we can compute the term numerically for a normal case.

                          Experiment 2: Computation of Ex . )  of (7.35)
                                                      (
                                Data:  I-I (Normal)
                                     M adjusted to give E*  = 2, 5,  10, 20, 30(%)
                                Dimensionality: n  = 2, 4, 8,  16
                                Sample size: N I  = N2 = 1600n
                                Metric: A  = I  (Euclidean)
                                Results: Table 7-2 [SI

                     In the experiment, B  of  (7.36) was evaluated at each generateL sample point
                     where the mathematical formulas based on the normality assumption were used
                     to  compute p(X) and  qi(X).  The  expectation of  (7.35)  was  replaced  by  the
                     sample mean taken over 160011 samples per class.
                          Table 7-2 reveals many properties of  the expectation term.  But, special
                     attention must be  paid to the fact that, once n becomes large (n > 4), its value
                     has little effect on the size of  the expectation.  This implies that PI of  (7.37)
                     dominates the  effect of  n  on  the  bias.  That  is,  the  bias  is  much  larger for
                     high-dimensions.  This coincides with the observation that, in practice, the NN
                     error comes down, contrary to  theoretical expectation, by  selecting a smaller
   330   331   332   333   334   335   336   337   338   339   340