Page 170 -
P. 170
158 5 Neural Networks
any real practical relevance. The last two are called sigmoidal functions because of
their S-shaped appearance. The parameter a governs the sigmoidal slope.
Discriminant units having these activation functions all have outputs in a well-
defined range [0, 11 or [-I, 11, which is quite convenient for classification
purposes. The step function (also called hard-limiter or threshold function) is not
differentiable in the whole domain. The other activation functions have the
advantage of being differentiable in the whole domain with easy derivatives.
Derivative of the logistic sigmoid (with a=l):
e-'
sig' (x) = = sig(x)(l -sig(x)). (5-1 la)
(I + emx
Derivative of the hyperbolic tangent (with a=l):
4
tanh ' (x) = = (1 - tanh(x))(l + tanh(x)) . (5-1 lb)
(ex +e-xp
The sigmoidal functions also have the good property that in addition to the
limiting aspect of the step function, they provide a linear behaviour near the zero
crossing. When approximating target values they are, therefore, considerably
versatile.
Compared with the logistic sigmoid, the hyperbolic tangent has the advantage of
usually affording a faster convergence. The logistic sigmoid has, however, the
relevant advantage that the outputs of the networks using this activation function
can be interpreted as posterior probabilities. As a matter of fact, for a two-class
Bayesian classifier, we have:
We can express this formula in terms of the logistic sigmoid (with a=l):
~(w, 1 x)= sig(t) with t = In P(X I 4 )Ph
P(X I ~~)P(wz)
Assuming normal distributions, with equal covariance for the likelihoods, one
readily obtains t = d(x) where d(x) is the linear discriminant depending on the
linear decision functions (4.23b) presented in chapter 4. The linear network is
therefore mimicking the Bayesian classifier.