Page 198 -
P. 198
186 5 Neural Networks
relations between E and the data probability distributions, it is also possible to
conclude that (see Bishop, 1995):
1. With n -+ , the error E converges to a minimum corresponding to:
where zk(x) is the regression solution for the training set Tk of target values for
the class ok. The integral (5-33) is also known as the conditional average of the
target data and denoted E[tk ( x].
2. The minimization of E corresponds to the hidden neurons transforming the input
data in a way similar to the Fisher discriminant described in section 4.1.4.
Therefore, the outputs are obtained as linear combinations in a reduced
dimensionality space, obtained by a Fisher discriminant-like transformation.
3. The multi-layer perceptron outputs are the class posterior probabilities for
sigmoid activation functions and Gaussian distributions of the patterns
(generalization of (5-12b)). Therefore, in this situation we have a neural net
solution equivalent to the statistical classification solution. For other types of
distributions it is also possible to easily obtain the posterior probabilities.
5.6.2 The Hessian Matrix
The learning capabilities of MLP's, using the minimization of a squared error
measure E, depend in many ways on the second order derivatives of E as functions
of the weights. These second order derivatives are the elements of the Hessian
matrix H of the error:
a2~
H=[hy] with hy=- ,
aw,awj
where wi and wjare any weights of the network.
The Hessian is a symmetric positive semi-definite matrix, and plays an
important role in several optimisation approaches to MLP training, as well as in the
convergence process towards a minimum error solution. In order to ascertain this
last aspect, let us assume that we had computed the eigenvectors ui and eigenvalues
A, of the Hessian, in a similar way to what we have done for the covariance matrix
in section 2.3: