Page 67 - Neural Network Modeling and Identification of Dynamical Systems
P. 67

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS               55
                          so-called “restarts,” i.e., to assign β (k)  ← 0.For  then the gradient and Hessian may be expressed
                          example, we might reset β (k)  if the consecutive  in terms of the error Jacobian as follows:

                                                       (k) T
                                                      p  p (k−1)
                          directions are nonorthogonal       2  >ε.In                ∂e (p) (W) T
                                                        p (k)                (p)               (p)
                                                                          ∇E   (W) =          e  (W),
                          the case of the Polak–Ribière method, we should              ∂W
                                                                                             T
                          also reset β (k)  if it becomes negative.       2  (p)     ∂e (p) (W) ∂e (p) (W)
                                                                         ∇ E   (W) =                        (2.41)
                            The basic second-order method is Newton’s                  ∂W        ∂W
                                                                                           2 (p)
                          method:                                                         ∂ e  (W)
                                                                                       n e
                                                                                     +       i     e (p) (W).
                                                                                                    i
                                                                                             ∂W
                                                 −1                                    i=1
                                       2
                                                        (k)
                              p (k)  =− ∇ E(W (k) )  ∇E(W ).   (2.38)
                                                    ¯
                                         ¯
                                                                       Then, the Gauss–Newton approximation to the
                                                                       Hessian is obtained by discarding the second-
                                               (k)
                                         2 ¯
                          If the Hessian ∇ E(W ) is positive definite,  order terms, i.e.,
                          the resulting search direction p (k)  is a descent                  (p)   T   (p)
                                                                                            ∂e  (W) ∂e    (W)
                                                                           2
                          direction. If the error function is convex and  ∇ E (p) (W) ≈ B (p)  =             .
                          quadratic, Newton’s method with a unit step                         ∂W        ∂W
                          length α (k)  = 1 finds thesolutioninasingle step.                                 (2.42)
                          For a smooth nonlinear error function with pos-
                                                                       The resulting matrix B can turn out to be degen-
                          itive definite Hessian at the solution, the con-  erate, so we might modify it by adding a scaled
                          vergence is quadratic, provided the initial guess  identity matrix as mentioned above in (2.39).
                          lies sufficiently close to the solution. If a Hes-  Then we have
                          sian turns out to have negative or zero eigen-                T
                          values, we need to modify it in order to obtain   (p)  ∂e (p) (W) ∂e (p) (W)  (k)
                                                                          B   =                   + μ I.    (2.43)
                          a positive definite approximation B – for exam-           ∂W       ∂W
                          ple, we might add a scaled identity matrix, so  This technique leads us to the Levenberg–
                          we have                                      Marquardt method.
                                                                         A family of quasi-Newton methods estimate
                                           2
                                                       (k)
                                    B (k)  =∇ E(W (k) ) + μ I.  (2.39)  the inverse Hessian by accumulating the changes
                                             ¯
                                                                       of gradients. These methods construct an in-
                                                                                                             −1
                                                                                                       2 ¯
                                                                       verse Hessian approximation H ≈∇ E(W)   so
                          The resulting damped method may be viewed    as to satisfy the secant equation:
                          as hybrid of the ordinary Newton method (for
                                                                                     (k)
                                                                               y
                          μ (k)  = 0) and a gradient descent (for μ (k)  →  H (k+1) (k)  = s ,
                          ∞).                                                   (k)   (k+1)   (k)
                                                                               s  = W     − W ,             (2.44)
                            Note that the Hessian computation is very           (k)       (k+1)        (k)
                                                                                      ¯
                                                                                                   ¯
                                                                               y  =∇E(W       ) −∇E(W ).
                          computationally expensive; hence there have
                          been proposed various approximations. If we  However, for n w > 1 this system of equations
                          assume that each individual error is a quadratic  is underdetermined and there exists an infi-
                          form,                                        nite number of solutions. Thus, additional con-
                                                                       straints are imposed, giving rise to various
                                         1  (p)  T  (p)                quasi-Newton methods. Most of them require
                                 (p)
                               E   (W) = e    (W) e  (W),      (2.40)  that the inverse Hessian approximation H (k+1)
                                         2
   62   63   64   65   66   67   68   69   70   71   72