Page 186 -
P. 186

174      5 Neural Networks


                                   For a hidden neuron the error term 4 is more difficult to obtain since it depends
                                 on the errors at the output neurons it is connected to, as exemplified in Figure 5.24.
                                 For this purpose we express 4 as a summation of chained derivatives:







                                   Note  that  the  first term  in  the  summation  corresponds to the  back-propagated
                                 error  from  an  output  (&,  and  the  second  term  reflects  the  influence  of  the
                                 activation  function of  the hidden  neurons,  as well  as the  weights connecting the
                                 hidden  neuron  to  the output  neurons. Assuming that  all  activation  functions are
                                 equal we can therefore write:





                                    Notice how the error terms at the output neurons contribute to the error terms at
                                 the hidden  neurons. This back-propagation  of  the errors justifies  the name of the
                                 algorithm.
                                    Using  these errors and the gradient descent  equations (5-7)  we can now  write
                                 the formulas for updating the weights.

                                 - Weight connecting output neuron k with hidden neuron j:




                                 - Weight connecting hidden neuron j with input neuron i:





                                    For  more  than  two  layers,  the  process  of  error  back-propagation  generalizes
                                 easily using the back-propagation  formula (5-23c).
                                    The back-propagation algorithm  uses formulas (5-24a) and (5-24b) with initial
                                 random  weights until  the iterative gradient descent process reaches a minimum of
                                 the energy function. The error hypersurface of a multi-layer perceptron depends on
                                 several weight  parameters,  and is  therefore  expected  to be quite complex and to
                                 possibly  have many  local  minima.  Notice  that  such a simple problem as the one
                                 presented in Figure 5.4 already exhibited local minima. Usually many trials have to
                                 be performed, with different initial weights and learning factor 7, in order to reach
                                 the  global  minimum.  Also,  for large  learning  factors,  one may  obtain divergent
                                 behaviour or wild oscillations around the minimum, as previously mentioned in the
                                 ECG filter example in  section 5.1. As a remedy  to this  oscillating behaviour it is
                                 normal to  include a momentum term in the weight  updating  formulas, dependent
                                 upon the weight increment in the previous iteration, as follows:
   181   182   183   184   185   186   187   188   189   190   191