Page 190 -
P. 190
178 5 Neural Networks
For instance, if we want to ensure with confidence a= 99% that we will reach a
solution among the best p = 20%, we will need to perform r = ln(1-0.99)An(l-0.2)
= 20 experiments.
Besides repeating experiments, there are other techniques that can achieve good
results:
- Case shuffling: by shuffling the cases in each epoch, we hope that the network
tries alternative descent routes, and avoids getting stuck in a path leading to a
local minimum.
- Adding noise: by adding a small amount of random noise to the inputs in each
epoch, we hope that a similar effect to case shuffling is achieved, namely the
exploration of alternative descent routes. Noise can also provide better
generalization of the neural net, preventing it from over-fitting the training set.
- Jogging weights: by adding a small random quantity to the weights it may be
possible to avoid a local minimum.
Dimensionality ratio and generalization
If we train a network with arbitrarily complex architecture we may also obtain
arbitrarily low errors for the training data, since we are attempting to model exactly
the structure of the training data by the neural network. As we have already seen in
previous chapters, the real issue is to obtain a solution that performs equally well
(on average) in independent test sets. Once more we are confronted with a
dimensionality ratio issue, here under the form of nlw, n being the total number of
patterns in the training set and w the total number of weights to adjust. If w is too
small we may obtain a neural net that is under-fitted; if w is too big it may be over-
fitted. The criteria for choosing an appropriate dimensionality ratio nlw are not
guided by exact formulas as in statistical classification, but rather by some intricate
combinatorial considerations about the number of partitions achieved by a neural
net, the main results of which will be presented in section 5.6.
It is common practice, in order to avoid over-fitting (except for trivial
problems), to reserve a part of the available data for independent verification or
validation purposes during training. At each epoch, the neural net solution is
applied to this set and the corresponding graph inspected. When degradation of the
validation set error is detected, it is assumed that some over-fitting is present and
the training is stopped.
Let us illustrate this aspect with the cork stoppers classification problem for
three classes. We divide each available dataset per class into approximately one
half of the cases for training, a quarter for validation and another quarter for
testing. Next we apply the back-propagation algorithm to an MLP7:5:3 (58
weights), having 7 features (the 10-feature set except NG, PRTG, RAAR) as
inputs. The neural network has three nominal outputs corresponding to the three
classes. Figure 5.26 shows how the validation error starts degrading after a certain
number of epochs, around 500. This means that after this point the neural net is
over-fitting the training data and losing the capacity to generalize to other
independent sets. We should therefore stop the training around 500 epochs.