Page 163 -
P. 163
5.1 LMS Adjusted Discriminants 15 1
where Y is the matrix of the transformed features yi=Axi).
Using generalized decision functions, one can obtain arbitrarily complex
decision surfaces at the possible expense of having to work in much higher
dimensional spaces, as already pointed out in 2.1.1. Another difficulty is that, in
practical applications, when there are several features, it may be quite impossible
to figure out the most appropriate transforming functions Axi). Also note that
nothing is gained by cascading an arbitrary number of linear discriminant units
(Figure 5. l), because a linear composition of linear discriminants is itself a linear
discriminant. In order to achieve more complex decision surfaces, what we really
need is to apply non-linear processing units, as will be done in the following
section.
A limitation of the method of minimum energy adjustment is that, in practice,
the solution of the normal equations may be difficult or even impossible to obtain,
due to X'X being singular or nearly singular. One can, however, circumvent this
limitation by using a gradient descent method, provided that E is a differentiable
function of the weights, as is verified when using the error energy (5-2a).
In order to apply the gradient descent method we begin with an initial guess of
the weight values (e.g. a random choice), and from there on we iteratively update
the weights in order to decrease the energy. The maximum decrease of the energy
is in the direction along the negative of the gradient, therefore we update the
weights on iteration r+l by adding a small amount of the negative of the gradient
computed at iteration r:
The factor 17, a small positive constant controlling how fast we move along the
negative of the gradient, is called the learning rate.
Consider the energy surface represented in Figure 5.2~. Starting at any point at
the top of the surface we will move in the direction of the steepest descent of the
surface. The choice of the learning rate is critical, since if it is too small we will
converge slowly to the minimum, and if it is too big we can get oscillations around
the minimum.
The weight updating expressed by equation (5-7) can be performed, one pattern
at a time, by computing the derivative of the energy Ei for the current pattern xi:
The process of weight adjustment is then repeated many times by cycling
through all patterns.
Let us compute the gradient in (5-7a). The energy contribution of each pattern xi
is computed using equation (5-la):