Page 66 - Neural Network Modeling and Identification of Dynamical Systems
P. 66

54                2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

                         it has decreased or not. If it has decreased, then  Another important first-order method is the
                         we accept this trial step, and we increase the  nonlinear conjugate gradient (CG) method. In fact,
                         step length. Otherwise, we reject the trial step,  it is a family of methods which utilize search di-
                         and decrease the step length. An alternative ap-  rections of the following general form:
                         proach is to perform a line search for an optimal
                                                                                         (0)
                                                                                     ¯
                         step length which provides the maximum pos-        p (0)  =−∇E(W ),
                                                                                                            (2.34)
                         sible reduction of the error function along the    p (k)  =−∇E(W (k) ) + β (k) (k−1) .
                                                                                                 p
                                                                                     ¯
                         search direction, i.e.,
                                                                      Depending on the choice of the scalar β (k) ,we

                               (k)             (k)    (k)
                                           ¯
                              α   = argminE W     + αp   .     (2.32)  obtain several variations of the method. The
                                      α>0                             most popular expressions for β (k)  are the follow-
                                                                      ing:
                         The GD method combined with this exact line
                         search is called the steepest gradient descent.Note  • the Fletcher–Reeves method:
                         that the global minimum of this univariate func-                   T
                                                                                         (k)
                                                                                               ¯
                                                                                      ¯
                         tion is hard to find; in fact, even a search for     (k)    ∇E(W ) ∇E(W    (k) )
                                                                            β  =                        ;   (2.35)
                         an accurate estimate of a local minimum would              ¯  (k−1)  T  ¯  (k−1)
                                                                                  ∇E(W     ) ∇E(W      )
                         require many iterations. Fortunately, we do not
                         need to find an exact minimum along the spec-  • the Polak–Ribière method:
                         ified direction–theconvergenceofanoverall                   (k)  T     (k)        (k−1)
                                                                                 ¯
                                                                                                      ¯
                                                                                            ¯
                         minimization procedure may be obtained if we     (k)  ∇E(W )    ∇E(W    ) −∇E(W      )
                                                                         β  =                                   ;
                                                                                              T
                         guarantee a sufficient decrease of an error func-           ∇E(W  (k−1) ) ∇E(W (k−1) )
                                                                                      ¯
                                                                                                  ¯
                         tion at each iteration. If the search direction is a                               (2.36)
                         descent direction and if the step lengths satisfy
                         the Wolfe conditions                         • the Hestenes–Stiefel method:
                                                                                 ¯  (k)  T     ¯  (k)  ¯  (k−1)
                                              ¯
                                       p
                           ¯
                          E W  (k)  + α (k) (k)    E W (k)                (k)  ∇E(W   )  ∇E(W ) −∇E(W         )
                                                                         β  =                                   .
                                                                                  ∇E(W ) −∇E(W    (k−1) )   T  p (k−1)
                                                                                       (k)
                                                                                   ¯
                                                                                              ¯
                                                             T
                                                               (k)
                                                      ¯
                                             + c 1 α (k) ∇E(W (k) ) p ,
                                                                                                            (2.37)
                                             T                T
                                                           (k)
                            ¯
                          ∇E W   (k)  + α (k) (k)  p (k)    c 2 ∇E(W ) p (k) ,                     (k)
                                        p
                                                       ¯
                                                                      Irrespective of the particular β  selected, the
                                                               (2.33)  first search direction p (0)  is simply the nega-
                                                                      tive gradient direction. If we assume that the er-
                         for 0 <c 1 <c 2 < 1, then the iterations con-  ror function is convex and quadratic (2.31), then

                                                              ) = 0
                                                         ¯
                         verge to a stationary point lim ∇E(W (k)     the method generates a sequence of conjugate

                                                 s→∞
                         from an arbitrary initial guess (i.e., we have a  search directions (i.e., p (i) T  Ap (j)  = 0 for i  = j). If
                         global convergence to a stationary point). Note  we also assume that the line searches are exact,
                         that there always exist intervals of step lengths  then the method converges within n w iterations.
                         which satisfy the Wolfe conditions. This justifies  In the general case of a nonlinear error func-
                         the use of inexact line search methods, which  tion, the convergence rate is linear; however, a
                         require fewer iterations to find an appropriate  twice differentiable error function with nonsin-
                         step length which provides a sufficient reduc-  gular Hessian is approximately quadratic in the
                         tion of an error function. Unfortunately, the GD  neighborhood of the solution, which results in
                         method has a linear convergence rate, which is  fast convergence. Note also that the search direc-
                         very slow.                                   tions lose conjugacy, hence we need to perform
   61   62   63   64   65   66   67   68   69   70   71