Page 122 -
P. 122

4.3 Model-Free Techni~ues   109

                             The Parzen  window  and k-nearest neighbour  methods, to be  explained  in  the
                           following sections, are both based on the idea of estimating the pdf  of the pattern
                           distributions. The ROC curve method is a completely model-free method for two-
                           class  situations,  which  simply  addresses  the  problem  of  choosing  the  best
                           discriminating thresholds as a compromise in conflicting cost requirements.
                             The  methods  incorporating  pdf  estimation  assume  well-behaved,  smooth
                           distributions, such that the probability density p(x) at point x can be estimated by
                           considering a sufficiently small region R with volume V, centred at x, as follows:





                             Imagine that one has a training set of n patterns and that k of these patterns fall
                           in the region R. The second hand integral of (4-31) then represents the number of
                           patterns falling inside R, therefore we obtain the estimate:






                             If  the volume is too small compared to the number of  samples, we will obtain
                           quite erratic estimates with a high variance. If it is too big so that it encloses a large
                           number of samples, we will obtain a smoothing effect since our point estimate will
                           depend on distant samples. These conflicting requirements can be addressed in two
                            ways:
                            1. Fix  the  volume  so that  it  is  inversely  dependent on  the  number  of  available
                              points, i.e., the more points there are, the smaller is the volume. Compute k from
                              the data. This leads to the so-called kernel-based approaches such as the Parzen
                              window method.
                            2. Fix  k depending on  the number of  points and compute  V from the data. This
                              leads to the k-nearest neighbour method.

                              Usually the model-free techniques are only used in low dimensional situations.
                            Estimation of a pdf  is a rather difficult issue in high-dimensional spaces, and ROC
                            analysis in this situation also requires burdensome compulations for a large number
                            of  combinations of thresholds.
                              It is worth noting that pdf estimation in high dimensional spaces suffers from the
                            curse  of  dirnensionalify  already  discussed  in  section  2.6.  Going  back  to  the
                            argument presented in that section, concerning the partitioning of the feature space
                            into  m~ypercubes, it  is  understandable  that  in  order  to  obtain  a  reliable pdf
                            estimate we  need  to  select a sufficiently high  number rn of  intervals  along each
                            dimension, and have at least one point in  each hypercube. Hence, the training set
                            size required for accurate  pdf estimation will grow exponentially with d.
                              Model-free  techniques  come  at  a  cost.  First  of  all,  no  design  formulas  are
                            available; therefore, a trial and error strategy often has to be employed in order to
                             obtain  a  "best"  solution.  Second,  dimensionality  ratio,  performance  and
   117   118   119   120   121   122   123   124   125   126   127