Page 78 - Neural Network Modeling and Identification of Dynamical Systems
P. 78

66                2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS

                            If the largest (in absolute value) eigenvalues  neurons. LSTM networks have been success-
                            of  ∂F(z(t r ),u(t r ),W)  are less than 1 for all time  fully applied in speech recognition, machine
                                    ∂z
                            steps t r , r = l,...,k − 1, then the norm of sen-  translation, and anomaly detection. How-
                                   ∂z(t k )                              ever, little attention has been attracted to
                            sitivity    will decay exponentially with
                                   ∂z(t l )
                            k − l. Hence, the terms of the error gradient  applications of LSTM for dynamical system
                                                                         modeling problems [81].
                            which correspond to recent time steps will
                                                                      2. Bifurcations of a recurrent neural network
                            dominate the sum. This is the reason why
                                                                         dynamics [82–84]. Since the recurrent neu-
                            gradient-based optimization methods learn
                                                                         ral network is a dynamical system itself,
                            short-term dependencies much faster than
                                                                         its phase portrait might undergo qualitative
                            the long-term ones. On the other hand, a gra-
                                                                         changes during the training. If these changes
                            dient explosion (exponential growth of its
                                                                         affect the actual predicted trajectories, this
                            norm) corresponds to a situation when the
                                                                         might lead to significant changes of the error
                            eigenvalues exceed 1 at all time steps. The
                                                                         in response to small changes of parameters
                            gradient explosion effect might lead to diver-
                            gence of the optimization method, unless care  (i.e., the gradient norm becomes very large),
                            is taken.                                    provided the duration of these trajectories is
                            In particular, if the mapping F is represented  large enough.
                                                                         In order to guarantee a complete absence of
                            by a layered feedforward neural network      bifurcations during the network training, we
                            (2.8), then the Jacobian  ∂F(z(t r ),u(t r ),W)  corre-
                                                       ∂z                would need a very good initial guess for
                            sponds to derivatives of network outputs
                                                                         its parameters, so that the model would al-
                            with respect to its inputs, i.e.,            ready possess the desired asymptotic behav-
                                                                         ior. Since this assumption is very unrealistic,
                            ∂a L         L    L     L       1    1     1
                                = diag ϕ (n ) ω ···diag ϕ (n ) ω .       it seems more reasonable to modify the op-
                            ∂a 0                                         timization methods in order to enforce their
                                                               (2.98)
                                                                         stability.
                                                                      3. Spurious valleys in the error surface [85–87].
                            Assume that the derivatives of all the ac-   These valleys are called spurious due to the
                                              l
                            tivation functions ϕ are bounded by some     fact that they do not depend on the desired
                                     l
                            constant η . Denote by λ l max  the eigenvalue  values of outputs ˜y(t k ). The location of these
                            with the largest magnitude for the weight    valleys is determined only by initial condi-
                                    l
                            matrix ω of the lth layer. If the inequality
                            (                                            tions z(t 0 ) and the controls u(t k ). Reasons for
                              L   l  l
                                 λ
                              l=1 max η < 1 holds, then the largest (in  occurrence of such valleys have been inves-
                            magnitude) eigenvalue of a Jacobian matrix   tigated in some special cases. For example, if
                            ∂a L  is less than one. Derivatives of the hyper-
                            ∂a 0                                         the initial state z(t 0 ) of (2.13) is a global re-
                            bolic tangent activation function, as well as  peller within some area of a parameter space,
                            the identity activation function, are bounded  then an infinitesimal control u(t k ) causes the
                            by 1.                                        model states z(t k ) to tend to infinity, which
                            One of the possibilities to speed up the train-  in turn leads to an unbounded error growth.
                            ing is to use the second-order optimization  Now assume that this area of parameter
                            methods [59,74]. Another option would be to  space contains a line along which the connec-
                            utilize the Long-Short Term Memory (LSTM)    tion weights between the controls u(t k ) and
                            models [72,75–80] specially designed to over-  the neurons of F are identically zero, that is,
                            come the vanishing gradient effect by using  the recurrent neural network (2.13) does not
                            the special memory cells instead of context  depend on controls. Parameters along this
   73   74   75   76   77   78   79   80   81   82   83