Page 69 - Neural Network Modeling and Identification of Dynamical Systems
P. 69

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS               57
                                p (k,0)  = 0,                            Note that since the error function (2.25)isa
                                             (k)
                                r (k,0)  =∇E(W ),                      summation of errors for each individual training
                                         ¯
                                                                       example, its gradient as well as its Hessian may
                                d (k,0)  =−∇E(W (k) ),                 also be represented as summations of gradients
                                          ¯
                                            r (k,s) T  r (k,s)         and Hessians of these errors, i.e.,
                                 (k,s)
                                α    =     T               ,
                                                   (k)
                                              2 ¯
                                       d (k,s)  ∇ E(W )d (k,s)                           P
                                                                                               (p)
                                                                                  ¯
                                                  d
                              p (k,s+1)  = p (k,s)  + α (k,s) (k,s) ,           ∇E(W) =     ∇E   (W),       (2.50)
                                                                                        p=1
                                                   2
                              r (k,s+1)  = r (k,s)  + α (k,s) ∇ E(W (k) )d (k,s) ,
                                                    ¯
                                                                                         P
                                                                                            2  (p)
                                                                               2
                                             T
                                                                                 ¯
                                       r (k,s+1)  r (k,s+1)                   ∇ E(W) =     ∇ E   (W).       (2.51)
                               (k,s+1)
                              β      =       T      ,                                   p=1
                                         r (k,s)  r (k,s)
                                                     d
                              d (k,s+1)  = r (k,s+1)  + β (k,s+1) (k,s) .  In the case the neural network has a large
                                                                       number of parameters n w and the data set con-
                          The iterations are terminated prematurely ei-  tains a large number of training examples P,
                          ther if they cross the trust region boundary,
                                       (k)                             computation of the total error function value E ¯
                           p
                            (k,s+1)       , or if a nonpositive curvature  as well as its derivatives can be time consuming.
                          direction is discovered, d (k,s) T  ∇ E(W (k) )d (k,s)    Thus, even for a simple GD method, each update
                                                      2 ¯
                          0. In these cases, a solution corresponds to the  of the weights takes a lot of time. Then, we might
                          intersection of the current search direction with  apply a stochastic gradient descent (SGD) method,
                          the trust region boundary. It is important to note  which randomly shuffles training examples, it-
                          that this method does not require one to com-  erates over them, and updates the parameters
                          pute the entire Hessian matrix; instead, we need                                 (p)
                                                                       using the gradients of individual errors E  :
                          only the Hessian vector products of the form
                                (k)
                          ∇ E(W )d  (k,s) , which may be computed more     W (k,p)  = W (k,p−1)  − α (k) ∇E (p) (W (k,p−1) ),
                           2 ¯
                          efficiently by reverse-mode automatic differen-   (k+1,0)   (k,P)
                          tiation methods described below. Such Hessian-  W      = W    .
                          free methods have been successfully applied to                                    (2.52)
                          neural network training [59,60].
                                                                       In contrast, the usual gradient descent is called
                            Another approach to solving (2.47)[61,62]re-
                          places the subproblem with an equivalent prob-  the batch method. We need to mention that al-
                          lem of finding both the vector p ∈ R n w  and the  though the (k,p)th step decreases the error for
                          scalar μ   0 such that                       the pth training example, it may increase the er-
                                                                       ror for the other examples. On the one hand it

                                2
                                                     ¯
                                 ¯
                              ∇ E(W  (k) ) + μI p =−∇E(W (k) ),        allows the method to escape some local min-
                                                               (2.49)  ima, but on the other hand it becomes difficult

                                     μ(  − p ) = 0,                    to converge to a final solution. In order to cir-

                                                                       cumvent this issue, we might gradually decrease
                                       (k)
                          where ∇ E(W ) + μI is positive semidefinite.                  (k)
                                 2 ¯
                                                                       the step lengths α  . Also, in order to achieve
                          There are two possibilities. If μ = 0, then we
                                            (k)    −1  (k)             a “smoother” convergence we could perform
                          have p =− ∇ E(W     )  ∇E(W    ) and p       the weight updates based on random subsets of
                                       2 ¯
                                                    ¯

                           .  If  μ> 0,    then  we  define   p(μ) =    training examples, which is called a “minibatch”
                                            −1
                                   (k)
                          − ∇ E(W ) + μI     ∇E(W  (k) ) and solve a one-  strategy. The stochastic or minibatch approach
                              2 ¯
                                               ¯

                          dimensional equation p(μ) =   with respect   may also be applied to other optimization meth-


                          to μ.                                        ods; see [63].
   64   65   66   67   68   69   70   71   72   73   74