Page 190 -
P. 190

178      5 Neural Networks

                                  For instance, if we want to ensure with confidence a= 99% that we will reach a
                                solution among the best p = 20%, we will need to perform r = ln(1-0.99)An(l-0.2)
                                = 20 experiments.
                                  Besides repeating experiments, there are other techniques that can achieve good
                                results:

                                -  Case shuffling: by  shuffling the cases in each epoch, we hope that the network
                                  tries alternative descent routes, and avoids getting stuck in a path  leading to a
                                  local minimum.
                                -  Adding noise: by  adding a small amount of random noise to the inputs in  each
                                  epoch, we hope that  a similar effect to case shuffling is achieved, namely the
                                  exploration  of  alternative  descent  routes.  Noise  can  also  provide  better
                                  generalization of the neural net, preventing it from over-fitting the training set.
                                -  Jogging weights: by  adding a small random quantity to the weights it may  be
                                  possible to avoid a local minimum.

                                Dimensionality ratio and generalization

                                If  we  train  a  network  with  arbitrarily complex architecture we  may  also obtain
                                arbitrarily low errors for the training data, since we are attempting to model exactly
                                the structure of the training data by the neural network. As we have already seen in
                                previous chapters, the real  issue is to obtain a solution that performs equally well
                                (on  average)  in  independent  test  sets.  Once  more  we  are  confronted  with  a
                                dimensionality ratio issue, here under the form of nlw, n being the total number of
                                patterns in  the training set and w the total number of weights to adjust. If  w is too
                                small we may obtain a neural net that is under-fitted; if w is too big it may be over-
                                fitted. The criteria  for  choosing an  appropriate dimensionality ratio nlw  are  not
                                guided by exact formulas as in statistical classification, but rather by some intricate
                                combinatorial considerations about the number of partitions achieved by  a neural
                                net, the main results of which will be presented in section 5.6.
                                  It  is  common  practice,  in  order  to  avoid  over-fitting  (except  for  trivial
                                problems), to  reserve a part  of  the  available data for independent verification or
                                validation  purposes  during  training.  At  each  epoch,  the  neural  net  solution is
                                applied to this set and the corresponding graph inspected. When degradation of the
                                validation set error is detected, it is assumed that some over-fitting is present and
                                the training is stopped.
                                   Let  us  illustrate this  aspect with  the  cork  stoppers classification problem for
                                three classes. We divide each available dataset per class into approximately one
                                half  of  the  cases  for  training,  a  quarter  for  validation  and  another  quarter for
                                testing.  Next  we  apply  the  back-propagation  algorithm  to  an  MLP7:5:3  (58
                                weights),  having  7  features  (the  10-feature set  except  NG,  PRTG,  RAAR) as
                                inputs. The neural  network has three  nominal outputs corresponding to  the three
                                classes.  Figure 5.26 shows how the validation error starts degrading after a certain
                                number of epochs, around 500. This means that  after this point the neural net  is
                                over-fitting  the  training  data  and  losing  the  capacity  to  generalize  to  other
                                independent sets. We should therefore stop the training around 500 epochs.
   185   186   187   188   189   190   191   192   193   194   195