Page 68 - Neural Network Modeling and Identification of Dynamical Systems
P. 68

56                2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

                         be symmetric and positive definite, and also  lor series approximation of the error function as
                         minimize the distance to the previous estimate  the model
                         H (k)  with respect to some norm: H (k+1)  =

                                                                                                       T
                         argmin H − H  (k)  . One of the most popular  E(W (k)  + p) ≈ M (k) (p) = E(W (k) ) + p ∇E(W (k) )
                                                                                                           ¯
                                                                                             ¯
                                                                                    ¯
                                                                       ¯

                            H
                                                                                               1
                         variations of quasi-Newton methods is the                           + p ∇ E(W )p
                                                                                                          (k)
                                                                                                    2
                                                                                                  T
                                                                                                      ¯
                         Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo-                         2
                                                                                                            (2.46)
                         rithm:
                                                                      and a ball of radius   (k)  as a trust region, we
                                              T                  T
                                        s
                         H (k+1)  = I −ρ (k) (k) (k)  H (k)  I −ρ (k) (k) (k)  obtain the following constrained quadratic opti-
                                                           y s
                                           y
                                                                      mization subproblems:
                                            T
                                         s
                                 + ρ (k) (k) (k)  ,            (2.45)
                                       s
                                     1                                                        (k)
                             (k)
                                                                                             ¯
                            ρ  =     T   .                                       minimize M     (p)
                                  y (k)  s (k)                                       p                      (2.47)
                                                                                                   (k)
                                                                                 subject to          .
                                                                                             p
                         The initial guess for inverse Hessian may be se-
                         lected in different ways: for example, if H (0)  =  The trust region radius is adapted based on the
                         I, then the first search direction corresponds to  ratio of predicted and actual reduction of an er-
                         that of GD. Note that if H (0)  is positive definite  ror function, i.e.,
                         and a step length is selected by a line search so
                                                                                 ¯
                                                                                           ¯
                         as to satisfy the Wolfe conditions (2.33), then all  (k)  E(W (k) ) − E(W (k)  + p (k) )
                                                                           ρ  =                        .    (2.48)
                                                                                      ¯
                                                                                             ¯
                         the resulting H (k)  will also be positive definite.         M(0) − M(p (k) )
                         The BFGS algorithm has a superlinear rate of
                         convergence. We should also mention that the  If this ratio is close to 1, the trust region radius
                         inverse Hessian approximation contains n 2 w  el-  is increased by some factor; if the ratio is close
                         ements, and when the number of parameters n w  to 0, the radius is decreased. Also, if ρ (k)  is nega-
                         is large, it might not fit in the memory. In order  tive or very small, the step is rejected. We could
                         to circumvent this issue, a limited memory BFGS  also use ellipsoidal trust regions of the form
                                                                                (k)
                         (L-BFGS) [56] version of the algorithm has been    Dp     , where D is a nonsingular diago-

                         proposed, which stores only the m   n w most  nal matrix.
                                           	   (j)  (j)  
 k−1           There exist various approaches to solving the
                         recent vector pairs  s  ,y        and uses
                                                     j=k−m            constrained quadratic optimization subproblem
                         them to compute the search direction without
                                              (k)
                         explicit evaluation of H .                   (2.47). One of these techniques [58] relies on the
                                                                      linear conjugate gradient method in order to solve
                            Another strategy, alternative to line search
                                                                      the system of linear equations
                         methods, is a family of trust region methods
                         [57]. These methods repeatedly construct some           2    (k)           (k)
                                                                                                ¯
                                                                                   ¯
                                      ¯ (k)
                         local model M    of the error function, which          ∇ E(W   )p =−∇E(W     ),
                         is assumed to be valid in a neighborhood of
                                        (k)
                         current point W , and minimize it within this  with respect to p. This results in the following
                         neighborhood. If we utilize a second-order Tay-  iterations:
   63   64   65   66   67   68   69   70   71   72   73