Page 204 -
P. 204
192 5 Neural Networks
5.6.4 Network Complexity
In order to appropriately solve the bias-variance trade-off described in the
preceding section, choosing a sensible network complexity for a given dataset is an
important task. Note that the bias-variance trade-off affords the appropriate insight
into the dimensionality ratio issue: for adequate training of a complex network with
w weights we need to set a lower bound on the needed number n of patterns in
order to ensure that the network has generalization capability, yielding
reproducible results in independent test sets. Note also that, except for single-layer
networks, the number of weights will often be much larger than the dimension d of
the pattern vectors.
In statistical classification it is possible to derive formulas describing, for
instance, the training set error behaviour of Bayesian normal classifiers with the
nld ratio. The problem is more complicated with neural networks, where no such
formulas are available. Instead, it is possible to derive bounds for the nlw ratio. As
the modelling capability of a neural network is a function of the discriminants
implemented by the neurons, such bounds are essentially based on the
combinatorial study of feasible class labels by sets of linear discriminants.
For this purpose, let us first consider the number of linearly separable
dichotomies of n points, where a dichotomy is any partition of the n patterns into
two disjointed groups. We consider that the n patterns are regularly distributed in
9Id, i.e., no subset of (d+l) patterns is contained in a 9Id hyperpane. For the two-
dimensional space, this corresponds to saying that no group of 3 points is co-linear.
It is possible to prove that given n regularly distributed points in sd, the
number of dichotomies that are linearly separable is given by:
When n=d+ 1, both expressions give the same result. The demonstration of this
result, known as Cover theorem, can be found in (Cover, 1965).
For d=2, the application of (5-5 1) gives the following results:
- For n=3, all the 23=8 dichotomies of the three points are linearly separable;
- For n=4, only 14 of the 16 dichotomies are linearly separable;
- For n=5, only 22 of the 32 dichotomies are linearly separable.
Using polynomial decision functions is equivalent to using linear discriminants
in a higher dimension space, as seen in 2.1.1. Thus, using quadratic decision
functions in d=2 is equivalent to using linear decision functions in d-1 =
(2+1)(2+2)/2 - 1 = 5 dimensions. This means that for n=5 patterns, all dichotomies
would be separable by the quadratic decision functions, since they are linearly
separable in the 5-dimension space (D(5,5) = 32).