Page 198 -
P. 198

186      5 Neural Networks


                                relations between E and  the data probability distributions, it  is  also  possible to
                                conclude that (see Bishop, 1995):

                                 1. With  n -+   , the error E converges to a minimum corresponding to:





                                   where zk(x) is the regression solution for the training set Tk of target values for
                                   the class ok. The integral (5-33) is also known as the conditional average of  the
                                   target data and denoted E[tk (  x].

                                 2. The minimization of E corresponds to the hidden neurons transforming the input
                                   data  in  a  way  similar  to  the  Fisher  discriminant described  in  section  4.1.4.
                                   Therefore,  the  outputs  are  obtained  as  linear  combinations  in  a  reduced
                                   dimensionality space, obtained by a Fisher discriminant-like transformation.
                                 3. The  multi-layer  perceptron  outputs  are  the  class  posterior  probabilities for
                                   sigmoid  activation  functions  and  Gaussian  distributions  of  the  patterns
                                   (generalization of  (5-12b)). Therefore, in  this  situation we  have  a neural  net
                                   solution equivalent to the statistical classification solution. For  other types of
                                   distributions it is also possible to easily obtain the posterior probabilities.



                                 5.6.2 The Hessian Matrix

                                 The  learning  capabilities  of  MLP's,  using  the  minimization of  a  squared  error
                                 measure E, depend in many ways on the second order derivatives of E as functions
                                 of  the  weights. These  second order derivatives are  the  elements of  the Hessian
                                 matrix H of the error:
                                                          a2~
                                   H=[hy]     with  hy=-         ,
                                                         aw,awj

                                 where wi and wjare any weights of the network.
                                   The  Hessian  is  a  symmetric  positive  semi-definite  matrix,  and  plays  an
                                 important role in several optimisation approaches to MLP training, as well as in the
                                 convergence  process  towards a minimum error solution. In order to ascertain this
                                 last aspect, let us assume that we had computed the eigenvectors ui and eigenvalues
                                 A, of the Hessian, in a similar way to what we have done for the covariance matrix
                                 in section 2.3:
   193   194   195   196   197   198   199   200   201   202   203