Page 170 -
P. 170

158    5 Neural Networks


                              any real practical relevance. The last two are called sigmoidal functions because of
                              their S-shaped appearance. The parameter a governs the sigmoidal slope.
                                Discriminant units having these activation functions all have outputs in a well-
                              defined  range  [0,  11  or  [-I,  11,  which  is  quite  convenient  for  classification
                              purposes. The step function (also called hard-limiter  or threshold function)  is not
                              differentiable  in  the  whole  domain.  The  other  activation  functions  have  the
                              advantage of being differentiable in the whole domain with easy derivatives.
                                Derivative of the logistic sigmoid (with a=l):

                                           e-'
                                 sig' (x) =       = sig(x)(l -sig(x)).                     (5-1 la)
                                         (I + emx

                                Derivative of the hyperbolic tangent (with a=l):

                                              4
                                 tanh ' (x) =       = (1 - tanh(x))(l + tanh(x)) .         (5-1 lb)
                                          (ex +e-xp


                                The  sigmoidal  functions  also  have  the  good  property  that  in  addition to  the
                              limiting aspect of the step function, they provide a linear behaviour near the zero
                              crossing.  When  approximating  target  values  they  are,  therefore,  considerably
                              versatile.
                                 Compared with the logistic sigmoid, the hyperbolic tangent has the advantage of
                              usually  affording  a  faster  convergence.  The  logistic  sigmoid  has,  however,  the
                              relevant advantage that the outputs of the networks using this activation function
                              can  be  interpreted as posterior probabilities. As  a matter of  fact, for a two-class
                              Bayesian classifier, we have:






                                 We can express this formula in terms of the logistic sigmoid (with a=l):

                                 ~(w,  1 x)= sig(t)  with  t = In  P(X  I 4 )Ph
                                                           P(X  I ~~)P(wz)


                                 Assuming normal distributions, with equal covariance  for the likelihoods, one
                              readily  obtains  t  = d(x) where  d(x) is the  linear discriminant  depending on  the
                              linear  decision  functions  (4.23b)  presented  in  chapter  4.  The  linear  network is
                              therefore mimicking the Bayesian classifier.
   165   166   167   168   169   170   171   172   173   174   175