Page 198 - Neural Network Modeling and Identification of Dynamical Systems
P. 198
5.4 HOMOTOPY CONTINUATION TRAINING METHOD FOR SEMIEMPIRICAL ANN-BASED MODELS 189
It is due to these additional parameters a that the thesolutionofthissystem w should be ad-
∗
Jacobian of H will have full rank for all τ ∈[0,1). ditionally verified: for example, if the Hessian
2
The following theorem [31] offers theoretical ∂ ¯ E(w ) has full rank and all of its eigenvalues
∗
2
∂w
justification for the probability-one homotopy are positive, then the solution is a local mini-
methods. mum. Also note that we have two possibilities:
Theorem 6. Let H: R n w ×[0,1) × R n w → R n w either we transform an optimization problem to
2
be a C -smooth vector-valued function, and let a system of equations and then construct a ho-
H a :[0,1) × R n w → R n w be a vector-valued func- motopy for this system, or we construct a homo-
tion which satisfies H a (τ,w) ≡ H(a,τ,w). Assume topy for the error function and then differentiate
that the zero vector 0 ∈ R n w is a regular value of H. it to obtain a homotopy for the equation system.
Finally, assume that for each value of additional pa- The homotopy continuation approach has
rameters a ∈ R n w the equation system H a (0,w) = 0 been previously applied to a feedforward neu-
has a unique solution ˜ w. Then, for almost all a ∈ R n w ral network training problem. Some authors [32,
2
there exists a C -smooth curve γ ⊂[0,1) × R , 33] have applied the convex homotopy (5.43)as
n w
emanating from (0, ˜ w) and satisfying H a (τ,w) = 0 well as the so-called “global” homotopy of the
∀(τ,w) ∈ γ . Also, if the curve γ is bounded, then it form
n w
has an accumulation point (1, ¯ w) for some ¯ w ∈ R .
Moreover, if the Jacobian of H a at the point (1, ¯ w) H(a,τ,w) = F(w) − (1 − τ)F(a) (5.44)
has full rank, then the curve γ has a finite arc length.
to the sum of the squared errors objective func-
Since the zero vector is a regular value, the tion. Gorse, Shepherd, and Taylor [34] have sug-
Jacobian of H has full rank at all points of
gested a homotopy that scales the training set
the curve γ , therefore this curve has no self-
target outputs from their mean value at τ = 0
intersections or intersections with the other so-
to the original value at τ = 1.Coetzee [35]has
lution curves of H a . Also, since the equation proposed a “natural” homotopy that transforms
system H a (0,w) = 0 has a unique solution, the neuron activation functions from linear to non-
curve γ cannot return to cross the hyperplane linear ones (ϕ(τ,n) = (1 − τ)n + τ thn), thereby
τ = 0. The convex homotopy (5.43) satisfies all deforming the problem from linear to nonlin-
2
the conditions of Theorem 6 for any C -smooth
ear regression. Coetzee has also suggested the
function F with ˜ w ≡ a. In order to guarantee
use of regularization in order to keep the solu-
the boundedness of γ , we may require that the tion curve γ bounded. Authors of [32,35] have
equation system H a (τ,w) = 0 does not have so- also studied the homotopy continuation meth-
lutions at infinity. This can be achieved by means
ods which allow for a search of multiple solu-
of regularization.
tions to the problem. However, in this book we
Although this method is designed for solving
are only concerned with a single solution search.
systems of nonlinear equations, it can be applied
to optimization problems as well. In order to do However, these homotopies are less efficient
that, we replace the error function minimization for a recurrent neural network training problem
problem E(w) → min with aproblemoffind- because the individual trajectory error function
¯
w (5.8) sensitivity to parameters w grows exponen-
ingastationarypoint ∂ ¯ E(w) = 0, i.e., with a sys- tially over time. Thus, even for a moderate pre-
∂w
tem of nonlinear equations. We should mention diction horizon ¯ t, the error function landscape
that these equations represent only the neces- becomes quite complicated. For instance, if we
sary conditions for a local minimum of the error utilize a convex homotopy (5.43)) and fail to
function. These conditions are not sufficient, un- choose a good initial guess w (0) for parameters,
less the error function is pseudoconvex. Hence, then the error function growth will be very rapid