Page 163 -
P. 163

5.1 LMS Adjusted Discriminants   15 1





                           where Y is the matrix of the transformed features yi=Axi).
                             Using  generalized  decision  functions,  one  can  obtain  arbitrarily  complex
                           decision  surfaces  at  the  possible  expense  of  having  to  work  in  much  higher
                           dimensional spaces, as already pointed out in 2.1.1.  Another difficulty is that, in
                           practical applications, when there are several features, it may  be quite impossible
                           to  figure  out  the  most  appropriate  transforming  functions Axi). Also  note  that
                           nothing is  gained  by  cascading an  arbitrary number of  linear discriminant units
                           (Figure 5.  l), because a linear composition of  linear discriminants is itself a linear
                           discriminant. In order to achieve more complex decision surfaces, what we really
                           need  is  to  apply  non-linear  processing  units,  as  will  be  done  in  the  following
                           section.
                             A limitation of the method of minimum energy adjustment is that, in  practice,
                           the solution of the normal equations may be difficult or even impossible to obtain,
                           due to X'X  being singular or nearly singular. One can, however, circumvent this
                           limitation by  using  a gradient descent method, provided that E is a differentiable
                           function of the weights, as is verified when using the error energy (5-2a).
                             In order to apply the gradient descent method we begin with an initial guess of
                           the weight values (e.g. a random choice), and from there on we iteratively update
                            the weights in order to decrease the energy. The maximum decrease of the energy
                            is  in  the  direction  along  the  negative  of  the  gradient,  therefore  we  update  the
                            weights on iteration r+l  by adding a small amount of the negative of the gradient
                            computed at iteration r:






                              The factor 17, a small positive constant controlling how fast we move along the
                            negative of the gradient, is called the learning rate.
                              Consider the energy surface represented in Figure 5.2~. Starting at any point at
                            the top of  the surface we will move in the direction of the steepest descent of  the
                            surface. The choice of the learning rate is critical, since if  it is too small we  will
                            converge slowly to the minimum, and if it is too big we can get oscillations around
                            the minimum.
                              The weight updating expressed by equation (5-7) can be performed, one pattern
                            at a time, by computing the derivative of the energy Ei for the current pattern xi:






                              The  process  of  weight  adjustment  is  then  repeated  many  times  by  cycling
                             through all patterns.
                              Let us compute the gradient in (5-7a). The energy contribution of each pattern xi
                             is computed using equation (5-la):
   158   159   160   161   162   163   164   165   166   167   168