Page 187 -
P. 187
5.5 Multi-Layer Perceptrons 175
This momentum term, with momentum factor a: tends to speed up the network
convergence, while at the same time avoiding oscillations. It acts in the same way
as the mass of a particle falling on a surface in a viscous medium: away from a
minimum the mass of the particle increases the speed along its downward
trajectory; near the minimum it dampens the oscillations around it. Similarly the
momentum term increases the learning rate in regions of low curvature and
decreases it in high curvature regions, therefore reducing oscillations in these
regions (for details see Qiang, 1999).
The previous weight updating formulas assume a pattern-by-pattern operation
mode. Usually it is more efficient to compute the errors for all the patterns and
update the weights using formulas with these total errors. This is the so-called
batch training, already mentioned in section 5.1. An iteration using all of the
available data is called an epoch, and the training is conducted by repeating the
weight updating process in a sufficiently large number of epochs.
5.5.2 Practical aspects
When training multi-layer perceptrons, and other types of neural nets as well,
several practical aspects must be taken into account; these, are described next.
Feature and architecture selection
When designing a neural net, one usually has to perform feature selection in the
same way as when designing statistical classifiers. However, the classical search
methods are more difficult or cumbersome to apply in the case of neural nets for
two reasons: for a given architecture, any configuration of features at the network
inputs demands a lengthy training process; for a given configuration of features at
the network inputs, the performance of the network depends on the architecture
used. Therefore, feature set and architecture work together in a coupled way.
Concerning the first issue, we will later present a feature selection method based on
genetic algorithms, which is quite fast and often produces quite good results.
Regarding the second issue, one may implement searching schemes for the "best"
solution in a domain of interesting architectures. This is the approach implemented
in Statistica under the name of Intelligent Problem Solver (IPS): once we have
specified the type of network, the range of features and some constraints on the
architecture such as the number of hidden nodes, the IPS will automatically search
for the "best" solutions.