Page 188 -
P. 188
176 5 Neural Networks
Because of the saturation effect of the most popular activation functions, namely
the sigmoidal functions, it is advisable to scale all inputs to a convenient range,
otherwise the inputs with higher values will drive the training process, masking the
contribution of lower valued inputs. Also, the choice of appropriate initial weights
depends on the interval range of the inputs; with weights far away from that range
the convergence time can increase significantly.
In general the pre-processing of the inputs consists of performing a linear
feature scaling in such a way that all of them occupy the [O, I] or [- 1, I] interval.
Post-processing operations are related to the application of neural nets for
classification purposes. In this situation it is usually convenient to code the outputs
as nominal variables, whose values are class labels.
In two-class classification problems one may use a single output, coded as a
two-state variable (e.g. -1, +I). When there are more than two classes it is more
appropriate to have one output for each class, with one nominal value representing
the class decision. For instance, for three classes one would have three outputs with
nominal values (+ 1, -1 ) corresponding to the class decisions wl=(+l, - 1, -I), a=
(-1, +1, -1) and a=(-1, -1, +I). Network output values can be converted to the
proper nominal value using thresholds. By setting up appropriate threshold values
one can also define reject regions, as done in section 4.2.3. Output thresholding is
easily performed using step functions.
Number of hidden neurons
In practical applications one rarely encounters architectures more complex than a
two-layer network, which, as we have seen, is capable of producing solutions
arbitrarily close to the optimal one. For datasets that are hard to train with two-
layer networks, a three-layer solution can be tried, usually with a low number of
hidden neurons (2 or 3) in the third layer. Concerning the number of hidden
neurons in the first layer, it can be proved (Bishop, 1995) that their role is to
perform a dimensional reduction, much in the way of the Fisher discriminant
transformation described in section 4.1.4. It is expected, therefore, that their
number will be near the number of significant eigenvectors of the data covariance
matrix. Bearing this in mind, it must be said that there is no fast guiding rule that
will substitute for experimentation.
Learning parameters
As previously seen, the learning rate controls how large the weight adjustment is in
each iteration. A large learning rate may lead to faster convergence, but may also
cause strong oscillations near the optimal solution, or even divergence. It is normal
to choose values below 0.5, and for non-trivial problems it is advisable to use a