Page 65 - Neural Network Modeling and Identification of Dynamical Systems
P. 65

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS               53
                            Minimization is carried out by means of vari-  methods use only the error function values; first-
                          ous iterative numerical methods. The optimiza-  order methods rely on the first derivatives (gra-
                          tion methods can be divided into global and lo-  dient ∇E); second-order methods also utilize the
                                                                              ¯
                                                                                                 2 ¯
                          cal ones, according to the type of minimum they  second derivatives (Hessian ∇ E).
                          seek for. Global optimization methods seek for  The basic descent method has the form
                          an approximate global minimum, whereas lo-     (k+1)   (k)  (k) (k)      (k+1)      (k)
                                                                                                          ¯
                                                                                               ¯
                          cal methods seek for a precise local minimum.  W   = W   + α  p   ,  E(W    )< E(W    ),
                          Most of the global optimization methods have a                                    (2.29)
                          stochastic nature (e.g., simulated annealing, evo-
                          lutionary algorithms, particle swarm optimiza-  where p (k)  is a search direction and α (k)  rep-
                          tion) and the convergence is achieved almost  resents a step length, also called the learning
                          surely and only in the limit. In this book we  rate. Note that we require each step to de-
                          focus on the local deterministic gradient-based  crease the error function. In order to guaran-
                          optimization methods, which guarantee a rapid  tee the error function decrease for arbitrarily
                          convergence to a local solution under some rea-  small step lengths, we need the search direc-
                          sonable assumptions. In order to apply these  tion to be a descent direction, that is, to satisfy
                          methods, we also require the error function to  p (k) T  ∇E(W (k) )< 0.
                                                                             ¯
                          be sufficiently smooth (which is usually the case  The simplest example of a first-order de-
                          with neural networks provided all the activa-  scent method is the gradient descent (GD) method,
                          tion functions are smooth). For more detailed  which utilizes the negative gradient search di-
                          information on local optimization methods, we  rection, i.e.,
                          refer to [49–52]. Metaheuristic global optimiza-
                                                                                             ¯
                          tion methods are covered in [53,54].                      p (k)  =−∇E(W (k) ).    (2.30)
                            Note that the local optimization methods re-
                          quire an initial guess W (0)  for parameter values.  The step lengths may be assigned beforehand ∀s
                                                                        (k)
                          There are various approaches to the initializa-  α  ≡ α, but if the step α is too large, the error
                                                                       function might actually increase, and then the it-
                          tion of network parameters. For example, the
                                                                       erations would diverge. For example, in the case
                          parameters may be sampled from a Gaussian
                                                                       of a convex quadratic error function of the form
                          distribution, i.e.,
                                                                                    1  T       T
                                                                             ¯
                                W i ∼ N(0,1), i = 1,...,n w .  (2.27)       E(W) = W AW + b W + c,          (2.31)
                                                                                    2
                          The following alternative initialization method  where A is a symmetric positive definite matrix
                          for layered feedforward neural networks (2.8),  with a maximum eigenvalue of λ max , the step
                          called Xavier initialization, was suggested in  length must satisfy
                          [55]:
                                                                                             2
                               l                                                        α<      ,
                             b = 0,
                               i                                                           λ max

                                            6         6        (2.28)
                             l
                            w i,j  ∼ U −        ,            .         in order to guarantee the convergence of gradi-
                                        S l−1  + S l  S l−1  + S l     ent descent iterations. On the other hand, a small
                                                                       step α would result in a slow convergence. In or-
                            Optimization methods may also be classified  der to circumvent this problem, we can perform
                          by the order of error function derivatives used  a step length adaptation: we take a “trial” step,
                          to guide the search process. Thus, zero-order  evaluate the error function and check whether
   60   61   62   63   64   65   66   67   68   69   70