Page 62 -
P. 62
48 2 Pattern Discrimination
Figure 2.25. Two class scatter plot with dimensionality ratio nld = 30.
This is a dramatic example of how the use of a reduced set of patterns compared
to the number of features - i.e., the use of a low dimensionality ratio, n/d - can lead
to totally wrong conclusions about a classifier (or regressor) performance evaluated
in a training set. We can get more insight into this dimensionality problem by
looking at it from the perspective of how many patterns one needs to have
available in order to design a classifier, i.e., what is the minimum size of the
training set. Consider that we would be able to train the classifier by deducing a
rule based on the location of each pattern in the d-dimensional space. In a certain
sense, this is in fact how the neural network approach works. In order to have a
sufficient resolution we assume that the range of values for each feature is divided
into rn intervals; therefore we have to assess the location of each pattern in each of
the nz" hypercubes. This number of hypercubes grows exponentially so that for a
value of d that is not too low we have to find a mapping for a quite sparsely
occupied space, i.e., with a poor representation of the mapping.
This phenomenon, generally called the curse of dimensionality phenomenon,
also affects our common intuition about the concept of neighbourhood. In order to
see this, imagine that we have a one-dimensional normal distribution. We know
then that about 68% of the distributed values lie within one standard deviation
around the mean. If we increase our representation to two independent dimensions,
we now have only about 46% in a circle around the mean and for a d-dimensional
representation we have (0.68)"~100% samples in a hypersphere with a radius of
one standard deviation, which means, as shown in Figure 2.26, that for (/=I2 less
than 1% of the data is in the neighbourhood of the mean! For the well-known 95%
neighbourhood, corresponding approxirnately to two standard deviations, we will
find only about 54% of the patterns for d=12.
The dimensionality ratio issue is a central issue in PR with a deep influence on
the quality of any PR project and, therefore, we will dedicate special attention to
this issue at all opportune moments.