Page 189 -
P. 189

5.5 Multi-Layer Perceptrons   177


                             decreasing  learning  rate,  finishing  with  a  small  value  after  a  large  number  of
                             epochs.




















                             Figure 5.25.  Learning curve for the dataset represented in Figure 5.23~. More than
                             20000 epochs are needed for a definite convergence path.



                               The  momentum  factor  is  chosen  in  the  range  [0,  1[, and  it  is  advisable  to
                             decrease it during training.
                               Note that when a class is represented by  very  few patterns, it may  take a long
                             time to train before the optimal solution is reached. This is a consequence of the
                             fact that the error energy will then suffer a small influence from the errors relative
                             to  the poorly represented class. This effect is exemplified by  the set 3 training of
                             the MLP Sets  data (Figure 5.23c), as illustrated in  Figure 5.25. A convergence to
                             the  global minimum,  using  a  MLP2:4:2:1,  was  observed  in  one trial  only  after
                             more  that  20000 iterations. For some initial values of  the learning parameters no
                             convergence was  observed.  In  such  difficult  cases  it  is  advisable  to  use  small
                             training factors. In the case of Figure 5.25 a value of 0.02 was used.

                             Local minima

                             In order to avoid local minima of the energy function, one can run several training
                             experiments with  different  specifications for the  initial weights and  the learning
                             and  momentum  factors.  The  number  of  experiments,  r, with  different  random
                             starting weights, needed to  ensure  that  a network will  reach  a solution  within a
                             desirable  lower  percentile  of  all  possible  experiments  is  given  by  (Iyer  and
                             Rhinehart, 1999):







                              where p is the percentile and a is the confidence level.
   184   185   186   187   188   189   190   191   192   193   194