Page 200 -
P. 200

188      5 Neural Networks







                                 which shows that the learning process in gradient descent depends on the size of
                                 the eigenvalues of the Hessian matrix. We can make this more explicit by  writing
                                 this last equation as:




                                    We conclude that the updating of the weights is directly related to the updating
                                 of  the distances ai, and these depend on the (1 -  yRi) factors. The distances will
                                 decrease only if  11  -  r;lll  < 1; steadily if  1 -  rAi is  positive, with  oscillations if
                                 negative. In order for the condition 11 - t7/1, I < 1 to be satisfied one must have:






                                 where A,,,,,  is the largest eigenvalue.
                                    On  the  other  hand,  the  speed  of  convergence  is  dominated  by  the  smallest
                                 eigenvalue. For the maximum 17 allowed by formula (5-42), the convergence speed
                                 along  the  eigenvector  direction  corresponding  to  the  smallest  eigenvalue  is
                                 governed by (1 - 2A,,,i,/A,,,11x).
                                    We  have  already  seen  in  section  5.1  how  the  learning  rate  influences the
                                 gradient descent process. Let us illustrate this dependence on the eigenvalues with
                                 the  example of  the error function (5-4c), with  two  weights denoted a and b for
                                 simplicity:






                                    The minimum of E occurs at [I, 01'.  From Exercise 5.9 it is possible to conclude
                                 that A,,=   4, therefore q < 0.5 for convergence.
                                    Figure  5.30  shows  the  horizontal  projection  of  the  parabolic  surface
                                 corresponding to E, with the progress of the gradient descent. If a low learning rate
                                 of  q= 0.15 is used, the convergence is slow. For  17 = 0.45, near the 0.5 limit, one
                                 starts getting oscillations around the minimum, along the vertical line. The reader
                                 can use the Error  Energyxls file to experiment with other values of  q and verify
                                 the occurrence of oscillations for q > 0.5.
                                    The problem of the oscillations is partly solved by  using the momentum factor
                                 of equations (5-25a) and (5-25b).
   195   196   197   198   199   200   201   202   203   204   205