Page 70 - Neural Network Modeling and Identification of Dynamical Systems
P. 70
58 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
We should also mention that in the case of the where x (p) ∈ X represent the input vectors and
batch or minibatch update strategy, the compu- ˜ y (p) ∈ Y represent the observed output vectors.
tation of total error function values, as well as Note that in general the observed outputs ˜y (p)
its derivatives, can be efficiently parallelized. In do not match the true outputs y (p) = f(x (p) ).We
order to do that, we need to divide the data set assume that the observations are corrupted by
into multiple subsets, compute partial sums of an additive Gaussian noise, i.e.,
the error function and its derivatives over the (p) (p) (p)
training examples of each subset in parallel, and ˜ y = y + η , (2.57)
then sum the results. This is not possible in the (p)
where η represent the sample points of a zero-
case of stochastic updates. In the case of an SGD
method, we can parallelize the gradient compu- mean random vector η ∼ N(0, ) with diagonal
covariance matrix
tations by neurons of each layer.
Finally, we note that any iterative method re- ⎛ σ 2 0 ⎞
quires a stopping criterion used to terminate the ⎜ 1 . . ⎟
procedure. One simple option is a test based on = ⎜ . ⎟ .
⎝
⎠
first-order necessary conditions for a local mini- 0 σ 2
n y
mum, i.e.,
The approximation is to be performed using a
(k)
∇E(W ) <ε g . (2.53) layered feedforward neural network of the form
(2.8). Under the abovementioned assumptions
We can also terminate iterations if it seems that on the observation noise, it is reasonable to uti-
no progress is made, i.e., lize a least-squares error function. Thus, we have
a total error function E of the form (2.25)with
¯
(k)
E(W ) − E(W (k+1) )<ε E , the individual errors
(2.54)
(k) (k+1)
W − W <ε w . (p) 1 (p) (p) T (p) (p)
E (W) = ˜ y −ˆy ˜y −ˆy ,
2
In order to prevent an infinite loop in the case (2.58)
of algorithm divergence, we might stop when a
certain maximum number of iterations has been where ˆy (p) represent the neural network out-
performed, i.e., puts given the corresponding inputs x (p) and
weights W. The diagonal matrix of fixed “er-
k k. (2.55) ror weights” has the form
¯
⎛ ⎞
2.2.2 Static Neural Network Training ω 1 0
.
⎜ ⎟
= ⎜ . . ⎟ ,
In this subsection, we consider the function ⎝ ⎠
approximation problem. The problem is stated 0 ω n y
as follows. Suppose that we wish to approxi-
mate an unknown mapping f: X → Y, where where ω i are usually taken to be inversely pro-
X ⊂ R n x and Y ⊂ R . Assume we are given an portional to noise variances.
n y
experimental data set of the form We need to minimize the total approximation
error E with respect to the neural network pa-
¯
P rameters W. If activation functions of all the neu-
x (p) , ˜y (p) , (2.56) rons are smooth, then the error function is also
p=1