Page 75 - Neural Network Modeling and Identification of Dynamical Systems
P. 75

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS               63
                          accurate closed-loop multistep-ahead prediction  function e of the following form:
                          of the dynamical system behavior. In this sub-
                          section, we discuss the most general state space        1              T
                                                                       e(˜y,z,W) =   ˜ y − G(z,W)    ˜y − G(z,W) ,
                          form (2.13) of dynamic neural networks.                 2
                            Assume we are given an experimental data                                        (2.85)
                          set of the form
                                                                                               ) is the diagonal ma-
                                                     &
                               %                   (p) P               where   = diag(ω 1 ,...,ω n y
                                                   K
                                  u (p) (t k ), ˜y (p) (t k )  ,  (2.82)  trix of error weights, usually taken inversely
                                                 k=0
                                                      p=1              proportional to the corresponding variances of
                                                                       measurement noise.
                          where P is the total number of trajectories, K (p)  We need to minimize the total prediction er-
                          is the number of time steps for the correspond-  ror E with respect to the neural network param-
                                                                           ¯
                          ing trajectory, t k = k t are the discrete time in-  eters W. Again, the minimization can be carried
                          stants, u (p) (t k ) are the control inputs, and ˜y (p) (t k )  out using any of the optimization methods de-
                          are the observed outputs. We will also denote  scribedinSection 2.2.1, provided we can com-
                          the total duration of the pth trajectory by ¯ t (p)  =  pute the gradient and Hessian of the error func-
                          K (p)  t.                                    tion with respect to the parameters. Just like in
                            Note that in general the observed outputs  the case of static neural networks, the total error
                          ˜ y (p) (t k ) do not match the true outputs y (p) (t k ).  gradient ∇E and Hessian ∇ E may be expressed
                                                                                               2 ¯
                                                                                 ¯
                          We assume that the observations are corrupted  in terms of the individual error gradients ∇E (p)
                          by an additive white Gaussian noise, i.e.,   and Hessians ∇ E (p) . Thus, we describe the al-
                                                                                     2
                                                                       gorithms for computation of the derivatives for
                                    ˜ y (p) (t) = y (p) (t) + η (p) (t).  (2.83)  (p)
                                                                       E   and omit the trajectory index p.
                                                                         Again, we have two different computation
                                  (p)
                          That is, η  (t) represents a stationary Gaussian  modes: forward-in-time and reverse-in-time,
                          process with zero mean and a covariance func-
                                                                       each with its own advantages and disadvan-
                          tion K η (t 1 ,t 2 ) = δ(t 2 − t 1 ) , where
                                                                       tages. The forward-in-time approach theoret-
                                         ⎛  2         ⎞                ically allows one to work with infinite dura-
                                           σ
                                            1     0                    tion trajectories, i.e., to perform online adapta-
                                               .
                                         ⎜            ⎟
                                       =  ⎜     . .   ⎟ .              tion as the new data arrive. In practice, how-
                                         ⎝            ⎠
                                             0       2                 ever, each iteration is more computationally ex-
                                                   σ
                                                     n y
                                                                       pensive as compared to the reverse-in-time ap-
                                                                       proach. The reverse-in-time approach is only ap-
                                                 (p)
                            The individual errors E  for each trajectory
                          have the following form:                     plicable when the whole training set is available
                                                                       beforehand, but it works significantly faster.
                                      K (p)                              BackPropagation through time algorithm

                              (p)           (p)    (p)                 (BPTT) [67–69] for error function gradient.
                            E   (W) =    e(˜y  (t k ),z  (t k ),W),  (2.84)
                                                                       First, we perform a forward pass to compute
                                      k=1
                                                                       the predicted states z(t k ) for all time steps t k ,
                          where z (p) (t k ) are the model states and e (p)  : R ×  k = 1,...,K, according to equations (2.13). We
                                                                 n y
                          R n z  × R n w  → R represents the model prediction  also compute the error E(W) according to (2.84)
                          error at time instant t k . Under the abovemen-  and (2.85).
                          tioned assumptions on the observation noise, it  We define the error function sensitivities with
                          is reasonable to utilize the instantaneous error  respect to model states at time step t k to be as
   70   71   72   73   74   75   76   77   78   79   80