Page 203 -
P. 203

5.6 Performance of Neural Networks   191


                                  If  the network is complex enough to perfectly fit a given training set, the bias
                               will be zero for that training set at the points xi, and will  also be typically low in
                                the  neighbourhood  of  those  points  for other training  sets; however,  the  network
                               will show a significant variance, directly related to the variance of the e(x,). If, at
                               the  other  extreme,  we  design  a  network  implementing  a  very  simple  function,
                                which will only reproduce the main trend of the target values, then we will obtain a
                               high bias  since the implemented  function will depart significantly  from the target
                                values, but a very low variance since it will be insensitive to the noise term e(xi).
                                  In general, we will have to make a compromise between using a complex model
                                with  a  good  fit  but  poor  generalization,  and  a  very  simple  model  with  good
                                generalization  but  with  significant  departure from the  desired  output. In  order to
                                decrease the  bias we  will have to implement more complex models,  but then, in
                                order to decrease the variance,  we will have to train  the model in larger datasets.
                                For a low size training set a simpler model usually performs better.
                                  The  choice  of  the  appropriate  model  complexity,  related  to  the  number  of
                                weights, will be discussed in the next section. Experimentally  there are several tips
                                that may help to tune the model adequately to the training data, namely:

                                - Early stopping. Use an independent validation  set during training  and stop the
                                  training  process  when  the  error  of  the  validation  set  starts  to  increase.  This
                                  technique was already mentioned in 5.5.2.
                                - Regularization. Select a model based not only on its performance but also on its
                                  complexity,  penalizing  models  that  are  highly  complex.  This  regularization
                                  technique  is  applied  by  Statistics  in  the  Intelligent  Problem  Solver.  Another
                                  regularization approach is the inclusion  in the error formula (5-2a) of an extra
                                  term penalizing large weights (weight regularization). As a matter of fact, when
                                  using  sigmoidal activation  functions,  small weights correspond  to the  "linear"
                                  central part of the functions. Large weights, on the contrary, mean a highly non-
                                  linear behaviour, providing high curvature surfaces with a perfect fit of the data.
                                  By  penalizing  the  larger  weights,  the  network  tends  to  develop  smoother
                                  surfaces without over-fitting the data (see Weigend eta[., 1991).
                                - Training  with  noise.  By  adding  a  small  amount  of  noise  to  the  input  values
                                  during  training,  we  are  actually  forcing  the  network  to  learn  from  several
                                  datasets,  and therefore decreasing the variance term  of  the error. The network
                                  will improve its generalization capability, as already mentioned in 5.5.2.
                                - Network pruning.  After  several training  experiments  inspect  the  weights  and
                                  remove those that  are very  small, and therefore contribute little to the output,
                                  and train again the pruned, simpler model. It is also possible to analyse how the
                                  error deteriorates by  removing each  variable  (sensitivity analysis). If  the error
                                  before and  after variable removal is practically  the  same, this  means  that  the
                                  removed  variable  has  no  significant  contribution  and one can  then  prune  the
                                  respective input.
   198   199   200   201   202   203   204   205   206   207   208