Page 68 - Neural Network Modeling and Identification of Dynamical Systems
P. 68
56 2. DYNAMIC NEURAL NETWORKS: STRUCTURES AND TRAINING METHODS
be symmetric and positive definite, and also lor series approximation of the error function as
minimize the distance to the previous estimate the model
H (k) with respect to some norm: H (k+1) =
T
argmin H − H (k) . One of the most popular E(W (k) + p) ≈ M (k) (p) = E(W (k) ) + p ∇E(W (k) )
¯
¯
¯
¯
H
1
variations of quasi-Newton methods is the + p ∇ E(W )p
(k)
2
T
¯
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algo- 2
(2.46)
rithm:
and a ball of radius (k) as a trust region, we
T T
s
H (k+1) = I −ρ (k) (k) (k) H (k) I −ρ (k) (k) (k) obtain the following constrained quadratic opti-
y s
y
mization subproblems:
T
s
+ ρ (k) (k) (k) , (2.45)
s
1 (k)
(k)
¯
ρ = T . minimize M (p)
y (k) s (k) p (2.47)
(k)
subject to .
p
The initial guess for inverse Hessian may be se-
lected in different ways: for example, if H (0) = The trust region radius is adapted based on the
I, then the first search direction corresponds to ratio of predicted and actual reduction of an er-
that of GD. Note that if H (0) is positive definite ror function, i.e.,
and a step length is selected by a line search so
¯
¯
as to satisfy the Wolfe conditions (2.33), then all (k) E(W (k) ) − E(W (k) + p (k) )
ρ = . (2.48)
¯
¯
the resulting H (k) will also be positive definite. M(0) − M(p (k) )
The BFGS algorithm has a superlinear rate of
convergence. We should also mention that the If this ratio is close to 1, the trust region radius
inverse Hessian approximation contains n 2 w el- is increased by some factor; if the ratio is close
ements, and when the number of parameters n w to 0, the radius is decreased. Also, if ρ (k) is nega-
is large, it might not fit in the memory. In order tive or very small, the step is rejected. We could
to circumvent this issue, a limited memory BFGS also use ellipsoidal trust regions of the form
(k)
(L-BFGS) [56] version of the algorithm has been Dp , where D is a nonsingular diago-
proposed, which stores only the m n w most nal matrix.
(j) (j)
k−1 There exist various approaches to solving the
recent vector pairs s ,y and uses
j=k−m constrained quadratic optimization subproblem
them to compute the search direction without
(k)
explicit evaluation of H . (2.47). One of these techniques [58] relies on the
linear conjugate gradient method in order to solve
Another strategy, alternative to line search
the system of linear equations
methods, is a family of trust region methods
[57]. These methods repeatedly construct some 2 (k) (k)
¯
¯
¯ (k)
local model M of the error function, which ∇ E(W )p =−∇E(W ),
is assumed to be valid in a neighborhood of
(k)
current point W , and minimize it within this with respect to p. This results in the following
neighborhood. If we utilize a second-order Tay- iterations: