Page 65 - Neural Network Modeling and Identification of Dynamical Systems

P. 65

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 53
Minimization is carried out by means of vari- methods use only the error function values; ﬁrst-
ous iterative numerical methods. The optimiza- order methods rely on the ﬁrst derivatives (gra-
tion methods can be divided into global and lo- dient ∇E); second-order methods also utilize the
¯
2 ¯
cal ones, according to the type of minimum they second derivatives (Hessian ∇ E).
seek for. Global optimization methods seek for The basic descent method has the form
an approximate global minimum, whereas lo- (k+1) (k) (k) (k) (k+1) (k)
¯
¯
cal methods seek for a precise local minimum. W = W + α p , E(W )< E(W ),
Most of the global optimization methods have a (2.29)
stochastic nature (e.g., simulated annealing, evo-
lutionary algorithms, particle swarm optimiza- where p (k) is a search direction and α (k) rep-
tion) and the convergence is achieved almost resents a step length, also called the learning
surely and only in the limit. In this book we rate. Note that we require each step to de-
focus on the local deterministic gradient-based crease the error function. In order to guaran-
optimization methods, which guarantee a rapid tee the error function decrease for arbitrarily
convergence to a local solution under some rea- small step lengths, we need the search direc-
sonable assumptions. In order to apply these tion to be a descent direction, that is, to satisfy
methods, we also require the error function to p (k) T ∇E(W (k) )< 0.
¯
be sufﬁciently smooth (which is usually the case The simplest example of a ﬁrst-order de-
with neural networks provided all the activa- scent method is the gradient descent (GD) method,
tion functions are smooth). For more detailed which utilizes the negative gradient search di-
information on local optimization methods, we rection, i.e.,
refer to [49–52]. Metaheuristic global optimiza-
¯
tion methods are covered in [53,54]. p (k) =−∇E(W (k) ). (2.30)
Note that the local optimization methods re-
quire an initial guess W (0) for parameter values. The step lengths may be assigned beforehand ∀s
(k)
There are various approaches to the initializa- α ≡ α, but if the step α is too large, the error
function might actually increase, and then the it-
tion of network parameters. For example, the
erations would diverge. For example, in the case
parameters may be sampled from a Gaussian
of a convex quadratic error function of the form
distribution, i.e.,
1 T T
¯
W i ∼ N(0,1), i = 1,...,n w . (2.27) E(W) = W AW + b W + c, (2.31)
2
The following alternative initialization method where A is a symmetric positive deﬁnite matrix
for layered feedforward neural networks (2.8), with a maximum eigenvalue of λ max , the step
called Xavier initialization, was suggested in length must satisfy
[55]:
2
l α< ,
b = 0,
i λ max

6 6 (2.28)
l
w i,j ∼ U − , . in order to guarantee the convergence of gradi-
S l−1 + S l S l−1 + S l ent descent iterations. On the other hand, a small
step α would result in a slow convergence. In or-
Optimization methods may also be classiﬁed der to circumvent this problem, we can perform
by the order of error function derivatives used a step length adaptation: we take a “trial” step,
to guide the search process. Thus, zero-order evaluate the error function and check whether

60 61 62 63 64 65 66 67 68 69 70