Page 67 - Neural Network Modeling and Identification of Dynamical Systems

P. 67

2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 55
so-called “restarts,” i.e., to assign β (k) ← 0.For then the gradient and Hessian may be expressed
example, we might reset β (k) if the consecutive in terms of the error Jacobian as follows:

(k) T
p p (k−1)
directions are nonorthogonal 2 >ε.In ∂e (p) (W) T
p (k) (p) (p)
∇E (W) = e (W),
the case of the Polak–Ribière method, we should ∂W
T
also reset β (k) if it becomes negative. 2 (p) ∂e (p) (W) ∂e (p) (W)
∇ E (W) = (2.41)
The basic second-order method is Newton’s ∂W ∂W
2 (p)
method: ∂ e (W)
n e
+ i e (p) (W).
i
∂W
−1 i=1
2
(k)
p (k) =− ∇ E(W (k) ) ∇E(W ). (2.38)
¯
¯
Then, the Gauss–Newton approximation to the
Hessian is obtained by discarding the second-
(k)
2 ¯
If the Hessian ∇ E(W ) is positive deﬁnite, order terms, i.e.,
the resulting search direction p (k) is a descent (p) T (p)
∂e (W) ∂e (W)
2
direction. If the error function is convex and ∇ E (p) (W) ≈ B (p) = .
quadratic, Newton’s method with a unit step ∂W ∂W
length α (k) = 1 ﬁnds thesolutioninasingle step. (2.42)
For a smooth nonlinear error function with pos-
The resulting matrix B can turn out to be degen-
itive deﬁnite Hessian at the solution, the con- erate, so we might modify it by adding a scaled
vergence is quadratic, provided the initial guess identity matrix as mentioned above in (2.39).
lies sufﬁciently close to the solution. If a Hes- Then we have
sian turns out to have negative or zero eigen- T
values, we need to modify it in order to obtain (p) ∂e (p) (W) ∂e (p) (W) (k)
B = + μ I. (2.43)
a positive deﬁnite approximation B – for exam- ∂W ∂W
ple, we might add a scaled identity matrix, so This technique leads us to the Levenberg–
we have Marquardt method.
A family of quasi-Newton methods estimate
2
(k)
B (k) =∇ E(W (k) ) + μ I. (2.39) the inverse Hessian by accumulating the changes
¯
of gradients. These methods construct an in-
−1
2 ¯
verse Hessian approximation H ≈∇ E(W) so
The resulting damped method may be viewed as to satisfy the secant equation:
as hybrid of the ordinary Newton method (for
(k)
y
μ (k) = 0) and a gradient descent (for μ (k) → H (k+1) (k) = s ,
∞). (k) (k+1) (k)
s = W − W , (2.44)
Note that the Hessian computation is very (k) (k+1) (k)
¯
¯
y =∇E(W ) −∇E(W ).
computationally expensive; hence there have
been proposed various approximations. If we However, for n w > 1 this system of equations
assume that each individual error is a quadratic is underdetermined and there exists an inﬁ-
form, nite number of solutions. Thus, additional con-
straints are imposed, giving rise to various
1 (p) T (p) quasi-Newton methods. Most of them require
(p)
E (W) = e (W) e (W), (2.40) that the inverse Hessian approximation H (k+1)
2

62 63 64 65 66 67 68 69 70 71 72