Page 66 - Neural Network Modeling and Identification of Dynamical Systems
P. 66
54 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
it has decreased or not. If it has decreased, then Another important first-order method is the
we accept this trial step, and we increase the nonlinear conjugate gradient (CG) method. In fact,
step length. Otherwise, we reject the trial step, it is a family of methods which utilize search di-
and decrease the step length. An alternative ap- rections of the following general form:
proach is to perform a line search for an optimal
(0)
¯
step length which provides the maximum pos- p (0) =−∇E(W ),
(2.34)
sible reduction of the error function along the p (k) =−∇E(W (k) ) + β (k) (k−1) .
p
¯
search direction, i.e.,
Depending on the choice of the scalar β (k) ,we
(k) (k) (k)
¯
α = argminE W + αp . (2.32) obtain several variations of the method. The
α>0 most popular expressions for β (k) are the follow-
ing:
The GD method combined with this exact line
search is called the steepest gradient descent.Note • the Fletcher–Reeves method:
that the global minimum of this univariate func- T
(k)
¯
¯
tion is hard to find; in fact, even a search for (k) ∇E(W ) ∇E(W (k) )
β = ; (2.35)
an accurate estimate of a local minimum would ¯ (k−1) T ¯ (k−1)
∇E(W ) ∇E(W )
require many iterations. Fortunately, we do not
need to find an exact minimum along the spec- • the Polak–Ribière method:
ified direction–theconvergenceofanoverall (k) T (k) (k−1)
¯
¯
¯
minimization procedure may be obtained if we (k) ∇E(W ) ∇E(W ) −∇E(W )
β = ;
T
guarantee a sufficient decrease of an error func- ∇E(W (k−1) ) ∇E(W (k−1) )
¯
¯
tion at each iteration. If the search direction is a (2.36)
descent direction and if the step lengths satisfy
the Wolfe conditions • the Hestenes–Stiefel method:
¯ (k) T ¯ (k) ¯ (k−1)
¯
p
¯
E W (k) + α (k) (k) E W (k) (k) ∇E(W ) ∇E(W ) −∇E(W )
β = .
∇E(W ) −∇E(W (k−1) ) T p (k−1)
(k)
¯
¯
T
(k)
¯
+ c 1 α (k) ∇E(W (k) ) p ,
(2.37)
T T
(k)
¯
∇E W (k) + α (k) (k) p (k) c 2 ∇E(W ) p (k) , (k)
p
¯
Irrespective of the particular β selected, the
(2.33) first search direction p (0) is simply the nega-
tive gradient direction. If we assume that the er-
for 0 <c 1 <c 2 < 1, then the iterations con- ror function is convex and quadratic (2.31), then
) = 0
¯
verge to a stationary point lim ∇E(W (k) the method generates a sequence of conjugate
s→∞
from an arbitrary initial guess (i.e., we have a search directions (i.e., p (i) T Ap (j) = 0 for i = j). If
global convergence to a stationary point). Note we also assume that the line searches are exact,
that there always exist intervals of step lengths then the method converges within n w iterations.
which satisfy the Wolfe conditions. This justifies In the general case of a nonlinear error func-
the use of inexact line search methods, which tion, the convergence rate is linear; however, a
require fewer iterations to find an appropriate twice differentiable error function with nonsin-
step length which provides a sufficient reduc- gular Hessian is approximately quadratic in the
tion of an error function. Unfortunately, the GD neighborhood of the solution, which results in
method has a linear convergence rate, which is fast convergence. Note also that the search direc-
very slow. tions lose conjugacy, hence we need to perform