Page 71 - Neural Network Modeling and Identification of Dynamical Systems
P. 71
2.2 ARTIFICIAL NEURAL NETWORK TRAINING METHODS 59
smooth. Hence, the minimization can be carried The automatic differentiation technique [64]
out using any of the optimization methods de- computes function derivatives at a point by ap-
scribed in Section 2.2.1. However, in order to plying the chain rule to the corresponding nu-
apply those methods, we need an efficient al- merical values instead of symbolic expressions.
gorithm to compute the gradient and Hessian This method produces accurate derivative val-
of the error function with respect to the param- ues, just like the symbolic differentiation, and
eters. As mentioned above, the total error gra- also allows for a certain performance optimiza-
¯
2 ¯
dient ∇E and Hessian ∇ E may be expressed tion. Note that automatic differentiation relies
in terms of the individual error gradients ∇E (p) on the original computational graph for the
2
and Hessians ∇ E (p) . Thus, all that remains is to function to be differentiated. Thus, if the original
compute the derivatives of E (p) . For notational graph makes use of some common intermedi-
convenience, in the remainder of this section we ate values, they will be efficiently reused by the
omit the training example index p. differentiation procedure. Automatic differen-
tiation is especially useful for neural network
There exist several approaches to computa-
training, since it scales well to multiple param-
tion of error function derivatives:
eters as well as higher-order derivatives. In this
• numeric differentiation; book, we adopt the automatic differentiation ap-
• symbolic differentiation; proach.
• automatic (or algorithmic) differentiation. Automatic differentiation encompasses two
different modes of computation: forward and re-
The numeric differentiation approach relies on
verse. Forward mode computes sensitivities of all
the derivative definition and approximates it via
variables with respect to input variables: it starts
finite differences. This method is very simple to
with the intermediate variables that explicitly
implement, but it suffers from truncation and depend on the input variables (most deeply
roundoff errors. It is especially inaccurate for nested subexpressions) and proceeds “forward”
higher-order derivatives. Also, it requires many by applying the chain rule, until the output vari-
function evaluations: for example, in order to es- ables are processed. Reverse mode computes sen-
timate the error function gradient with respect to sitivities of output variables with respect to all
n w parameters using the simplest forward differ- variables: it starts with the intermediate vari-
ence scheme we require error function values at ables on which the output variables explicitly
n w + 1 points.
depend (outermost subexpressions) and pro-
Symbolic differentiation transforms a symbolic ceeds “in reverse” by applying the chain rule,
expression for the original function (usually rep- until the input variables are processed. Each
resented in the form of a computational graph) mode has its own advantages and disadvan-
into symbolic expressions for its derivatives by tages. The forward mode allows to compute
applying a chain rule. The resulting expressions function values as well as its derivatives of mul-
may be evaluated at any point accurately to tiple orders in a single pass. On the other hand,
working precision. However, these expressions in order to compute the rth-order derivative us-
usually end up to have many identical subex- ing the reverse mode, one needs the derivatives
pressions, which leads to duplicate computa- of all the lower orders s = 0,...,r − 1 before-
tions (especially in the case we need the deriva- hand. Computational complexity of first-order
tives with respect to multiple parameters). In or- derivatives computation in the forward mode is
der to avoid this, we need to simplify the expres- proportional to the number of inputs, while in
sions for derivatives, which presents a nontrivial the reverse mode it is proportional to the num-
problem. ber of outputs. In our case, there is only one out-