Page 77 - Neural Network Modeling and Identification of Dynamical Systems
P. 77

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS               65
                                                                        2
                            Backpropagation through time algorithm     ∂ e(˜y,z,W)  ∂G(z,W)  T  ∂G(z,W)
                                                                                  =
                          for error gradient and Hessian. Second-order     ∂W 2        ∂W         ∂W
                          sensitivities of the error function are computed             ∂ G i (z,W)
                                                                                     n y
                                                                                        2
                          during a backward-in-time pass as follows:              −          2   ω i ˜y i − G i (z,W) ,
                                                                                          ∂W
                                                                                    i=1
                                                                        2
                           ∂λ(t K+1 )                                  ∂ e(˜y,z,W)  ∂G(z,W)  T  ∂G(z,W)
                                   = 0,                                           =
                             ∂W                                           ∂W∂z         ∂W          ∂z
                                      2
                                                                                     n y
                                                                                        2
                             ∂λ(t k )  ∂ e(˜y(t k ),z(t k ),W)                         ∂ G i (z,W)
                                   =                                              −              ω i ˜y i − G i (z,W) ,
                              ∂W          ∂z∂W                                            ∂W∂z
                                                                                    i=1
                                        2
                                       ∂ e(˜y(t k ),z(t k ),W) ∂z(t k )
                                                                        2
                                     +                                 ∂ e(˜y,z,W)  ∂G(z,W)  T  ∂G(z,W)
                                              ∂z 2       ∂W                       =
                                                                           ∂z 2         ∂z         ∂z
                                        n z      #  2
                                                  ∂ F i (z(t k ),u(t k ),W)
                                                                                     n y
                                                                                        2
                                     +    λ i (t k+1 )                                 ∂ G i (z,W)
                                                        ∂z∂W                                     ω i ˜y i − G i (z,W) .
                                       i=1                                        −        ∂z 2
                                        2
                                       ∂ F i (z(t k ),u(t k ),W) ∂z(t k )  $        i=1
                                     +                                                                      (2.96)
                                              ∂z 2        ∂W
                                                       T
                                       ∂F(z(t k ),u(t k ),W) ∂λ(t k+1 )  In the rest of this subsection, we discuss var-
                                     +                          ,
                                              ∂z           ∂W          ious difficulties associated with the recurrent
                                       k = K,...,1.                    neural network training problem. First, notice
                                                               (2.94)  that a recurrent neural network which performs
                                                                       a K-step-ahead prediction may be “unfolded” in
                            The Hessian of the individual trajectory error  time to produce an equivalent layered feedfor-
                          function (2.84) equals                       ward neural network, comprised of K copies of
                                                                       the same subnetwork, one per time step. Each
                                                                       of these identical subnetworks shares a common
                                     K
                                         2
                           2
                          ∂ E(W)       ∂ e(˜y(t k ),z(t k ),W)         set of parameters.
                                  =
                            ∂W 2              ∂W 2                       Given a large prediction horizon, the result-
                                    k=1
                                       2
                                      ∂ e(˜y(t k ),z(t k ),W) ∂z(t k )  ing feedforward network becomes very deep.
                                    +                                  Thus, it is natural that all the difficulties associ-
                                            ∂W∂z        ∂W
                                                                       ated with deep neural network training are also
                                       n z    #  2
                                               ∂ F i (z(t k−1 ),u(t k−1 ),W)  inherent to recurrent neural network training. In
                                    +    λ i (t k )
                                                        ∂W 2           fact, these problems become even more severe.
                                      i=1
                                       2
                                      ∂ F i (z(t k−1 ),u(t k−1 ),W) ∂z(t k−1 )  $  They include the following:
                                    +
                                              ∂W∂z            ∂W       1. Vanishing and exploding gradients [71–74].
                                                          T
                                      ∂F(z(t k−1 ),u(t k−1 ),W) ∂λ(t k )  Note that the sensitivity of a recurrent neu-
                                    +                            .       ral network (2.13) state at time step t k with
                                               ∂W            ∂W
                                                                         respect to its state at time step t l (l   k)has
                                                                         the following form:
                                                               (2.95)
                                                                                     k−1
                            Second-order derivatives of the instantaneous    ∂z(t k )  '  ∂F(z(t r ),u(t r ),W)
                                                                                   =                    .   (2.97)
                          error function (2.85) have the form                ∂z(t l )          ∂z
                                                                                     r=l
   72   73   74   75   76   77   78   79   80   81   82