Page 69 - Neural Network Modeling and Identification of Dynamical Systems
P. 69
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 57
p (k,0) = 0, Note that since the error function (2.25)isa
(k)
r (k,0) =∇E(W ), summation of errors for each individual training
¯
example, its gradient as well as its Hessian may
d (k,0) =−∇E(W (k) ), also be represented as summations of gradients
¯
r (k,s) T r (k,s) and Hessians of these errors, i.e.,
(k,s)
α = T ,
(k)
2 ¯
d (k,s) ∇ E(W )d (k,s) P
(p)
¯
d
p (k,s+1) = p (k,s) + α (k,s) (k,s) , ∇E(W) = ∇E (W), (2.50)
p=1
2
r (k,s+1) = r (k,s) + α (k,s) ∇ E(W (k) )d (k,s) ,
¯
P
2 (p)
2
T
¯
r (k,s+1) r (k,s+1) ∇ E(W) = ∇ E (W). (2.51)
(k,s+1)
β = T , p=1
r (k,s) r (k,s)
d
d (k,s+1) = r (k,s+1) + β (k,s+1) (k,s) . In the case the neural network has a large
number of parameters n w and the data set con-
The iterations are terminated prematurely ei- tains a large number of training examples P,
ther if they cross the trust region boundary,
(k) computation of the total error function value E ¯
p
(k,s+1) , or if a nonpositive curvature as well as its derivatives can be time consuming.
direction is discovered, d (k,s) T ∇ E(W (k) )d (k,s) Thus, even for a simple GD method, each update
2 ¯
0. In these cases, a solution corresponds to the of the weights takes a lot of time. Then, we might
intersection of the current search direction with apply a stochastic gradient descent (SGD) method,
the trust region boundary. It is important to note which randomly shuffles training examples, it-
that this method does not require one to com- erates over them, and updates the parameters
pute the entire Hessian matrix; instead, we need (p)
using the gradients of individual errors E :
only the Hessian vector products of the form
(k)
∇ E(W )d (k,s) , which may be computed more W (k,p) = W (k,p−1) − α (k) ∇E (p) (W (k,p−1) ),
2 ¯
efficiently by reverse-mode automatic differen- (k+1,0) (k,P)
tiation methods described below. Such Hessian- W = W .
free methods have been successfully applied to (2.52)
neural network training [59,60].
In contrast, the usual gradient descent is called
Another approach to solving (2.47)[61,62]re-
places the subproblem with an equivalent prob- the batch method. We need to mention that al-
lem of finding both the vector p ∈ R n w and the though the (k,p)th step decreases the error for
scalar μ 0 such that the pth training example, it may increase the er-
ror for the other examples. On the one hand it
2
¯
¯
∇ E(W (k) ) + μI p =−∇E(W (k) ), allows the method to escape some local min-
(2.49) ima, but on the other hand it becomes difficult
μ( − p ) = 0, to converge to a final solution. In order to cir-
cumvent this issue, we might gradually decrease
(k)
where ∇ E(W ) + μI is positive semidefinite. (k)
2 ¯
the step lengths α . Also, in order to achieve
There are two possibilities. If μ = 0, then we
(k) −1 (k) a “smoother” convergence we could perform
have p =− ∇ E(W ) ∇E(W ) and p the weight updates based on random subsets of
2 ¯
¯
. If μ> 0, then we define p(μ) = training examples, which is called a “minibatch”
−1
(k)
− ∇ E(W ) + μI ∇E(W (k) ) and solve a one- strategy. The stochastic or minibatch approach
2 ¯
¯
dimensional equation p(μ) = with respect may also be applied to other optimization meth-
to μ. ods; see [63].