Page 203 -
P. 203
5.6 Performance of Neural Networks 191
If the network is complex enough to perfectly fit a given training set, the bias
will be zero for that training set at the points xi, and will also be typically low in
the neighbourhood of those points for other training sets; however, the network
will show a significant variance, directly related to the variance of the e(x,). If, at
the other extreme, we design a network implementing a very simple function,
which will only reproduce the main trend of the target values, then we will obtain a
high bias since the implemented function will depart significantly from the target
values, but a very low variance since it will be insensitive to the noise term e(xi).
In general, we will have to make a compromise between using a complex model
with a good fit but poor generalization, and a very simple model with good
generalization but with significant departure from the desired output. In order to
decrease the bias we will have to implement more complex models, but then, in
order to decrease the variance, we will have to train the model in larger datasets.
For a low size training set a simpler model usually performs better.
The choice of the appropriate model complexity, related to the number of
weights, will be discussed in the next section. Experimentally there are several tips
that may help to tune the model adequately to the training data, namely:
- Early stopping. Use an independent validation set during training and stop the
training process when the error of the validation set starts to increase. This
technique was already mentioned in 5.5.2.
- Regularization. Select a model based not only on its performance but also on its
complexity, penalizing models that are highly complex. This regularization
technique is applied by Statistics in the Intelligent Problem Solver. Another
regularization approach is the inclusion in the error formula (5-2a) of an extra
term penalizing large weights (weight regularization). As a matter of fact, when
using sigmoidal activation functions, small weights correspond to the "linear"
central part of the functions. Large weights, on the contrary, mean a highly non-
linear behaviour, providing high curvature surfaces with a perfect fit of the data.
By penalizing the larger weights, the network tends to develop smoother
surfaces without over-fitting the data (see Weigend eta[., 1991).
- Training with noise. By adding a small amount of noise to the input values
during training, we are actually forcing the network to learn from several
datasets, and therefore decreasing the variance term of the error. The network
will improve its generalization capability, as already mentioned in 5.5.2.
- Network pruning. After several training experiments inspect the weights and
remove those that are very small, and therefore contribute little to the output,
and train again the pruned, simpler model. It is also possible to analyse how the
error deteriorates by removing each variable (sensitivity analysis). If the error
before and after variable removal is practically the same, this means that the
removed variable has no significant contribution and one can then prune the
respective input.