Page 78 - Neural Network Modeling and Identification of Dynamical Systems
P. 78
66 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
If the largest (in absolute value) eigenvalues neurons. LSTM networks have been success-
of ∂F(z(t r ),u(t r ),W) are less than 1 for all time fully applied in speech recognition, machine
∂z
steps t r , r = l,...,k − 1, then the norm of sen- translation, and anomaly detection. How-
∂z(t k ) ever, little attention has been attracted to
sitivity will decay exponentially with
∂z(t l )
k − l. Hence, the terms of the error gradient applications of LSTM for dynamical system
modeling problems [81].
which correspond to recent time steps will
2. Bifurcations of a recurrent neural network
dominate the sum. This is the reason why
dynamics [82–84]. Since the recurrent neu-
gradient-based optimization methods learn
ral network is a dynamical system itself,
short-term dependencies much faster than
its phase portrait might undergo qualitative
the long-term ones. On the other hand, a gra-
changes during the training. If these changes
dient explosion (exponential growth of its
affect the actual predicted trajectories, this
norm) corresponds to a situation when the
might lead to significant changes of the error
eigenvalues exceed 1 at all time steps. The
in response to small changes of parameters
gradient explosion effect might lead to diver-
gence of the optimization method, unless care (i.e., the gradient norm becomes very large),
is taken. provided the duration of these trajectories is
In particular, if the mapping F is represented large enough.
In order to guarantee a complete absence of
by a layered feedforward neural network bifurcations during the network training, we
(2.8), then the Jacobian ∂F(z(t r ),u(t r ),W) corre-
∂z would need a very good initial guess for
sponds to derivatives of network outputs
its parameters, so that the model would al-
with respect to its inputs, i.e., ready possess the desired asymptotic behav-
ior. Since this assumption is very unrealistic,
∂a L L L L 1 1 1
= diag ϕ (n ) ω ···diag ϕ (n ) ω . it seems more reasonable to modify the op-
∂a 0 timization methods in order to enforce their
(2.98)
stability.
3. Spurious valleys in the error surface [85–87].
Assume that the derivatives of all the ac- These valleys are called spurious due to the
l
tivation functions ϕ are bounded by some fact that they do not depend on the desired
l
constant η . Denote by λ l max the eigenvalue values of outputs ˜y(t k ). The location of these
with the largest magnitude for the weight valleys is determined only by initial condi-
l
matrix ω of the lth layer. If the inequality
( tions z(t 0 ) and the controls u(t k ). Reasons for
L l l
λ
l=1 max η < 1 holds, then the largest (in occurrence of such valleys have been inves-
magnitude) eigenvalue of a Jacobian matrix tigated in some special cases. For example, if
∂a L is less than one. Derivatives of the hyper-
∂a 0 the initial state z(t 0 ) of (2.13) is a global re-
bolic tangent activation function, as well as peller within some area of a parameter space,
the identity activation function, are bounded then an infinitesimal control u(t k ) causes the
by 1. model states z(t k ) to tend to infinity, which
One of the possibilities to speed up the train- in turn leads to an unbounded error growth.
ing is to use the second-order optimization Now assume that this area of parameter
methods [59,74]. Another option would be to space contains a line along which the connec-
utilize the Long-Short Term Memory (LSTM) tion weights between the controls u(t k ) and
models [72,75–80] specially designed to over- the neurons of F are identically zero, that is,
come the vanishing gradient effect by using the recurrent neural network (2.13) does not
the special memory cells instead of context depend on controls. Parameters along this