Page 200 -
P. 200
188 5 Neural Networks
which shows that the learning process in gradient descent depends on the size of
the eigenvalues of the Hessian matrix. We can make this more explicit by writing
this last equation as:
We conclude that the updating of the weights is directly related to the updating
of the distances ai, and these depend on the (1 - yRi) factors. The distances will
decrease only if 11 - r;lll < 1; steadily if 1 - rAi is positive, with oscillations if
negative. In order for the condition 11 - t7/1, I < 1 to be satisfied one must have:
where A,,,,, is the largest eigenvalue.
On the other hand, the speed of convergence is dominated by the smallest
eigenvalue. For the maximum 17 allowed by formula (5-42), the convergence speed
along the eigenvector direction corresponding to the smallest eigenvalue is
governed by (1 - 2A,,,i,/A,,,11x).
We have already seen in section 5.1 how the learning rate influences the
gradient descent process. Let us illustrate this dependence on the eigenvalues with
the example of the error function (5-4c), with two weights denoted a and b for
simplicity:
The minimum of E occurs at [I, 01'. From Exercise 5.9 it is possible to conclude
that A,,= 4, therefore q < 0.5 for convergence.
Figure 5.30 shows the horizontal projection of the parabolic surface
corresponding to E, with the progress of the gradient descent. If a low learning rate
of q= 0.15 is used, the convergence is slow. For 17 = 0.45, near the 0.5 limit, one
starts getting oscillations around the minimum, along the vertical line. The reader
can use the Error Energyxls file to experiment with other values of q and verify
the occurrence of oscillations for q > 0.5.
The problem of the oscillations is partly solved by using the momentum factor
of equations (5-25a) and (5-25b).