Page 188 -
P. 188

176      5 Neural Networks




                                  Because of  the saturation effect of  the most popular activation functions, namely
                                  the sigmoidal functions,  it  is advisable to scale all  inputs to a convenient range,
                                  otherwise the inputs with higher values will drive the training process, masking the
                                  contribution of lower valued inputs. Also, the choice of appropriate initial weights
                                  depends on the interval range of the inputs; with weights far away from that range
                                  the convergence time can increase significantly.
                                    In  general  the  pre-processing  of  the  inputs  consists  of  performing  a  linear
                                  feature scaling in such a way that all of them occupy the [O,  I] or [- 1, I] interval.




                                  Post-processing  operations  are  related  to  the  application  of  neural  nets  for
                                  classification purposes. In this situation it is usually convenient to code the outputs
                                  as nominal variables, whose values are class labels.
                                    In  two-class  classification  problems one may  use a single output,  coded  as a
                                  two-state variable (e.g. -1,  +I). When there are more than two classes it is more
                                  appropriate to have one output for each class, with one nominal value representing
                                  the class decision. For instance, for three classes one would have three outputs with
                                  nominal values  (+ 1, -1 )  corresponding to the class decisions wl=(+l, - 1, -I), a=
                                  (-1, +1, -1) and  a=(-1, -1, +I). Network output  values can  be  converted to the
                                  proper nominal value using thresholds. By  setting up appropriate threshold values
                                  one can also define reject regions, as done in section 4.2.3. Output thresholding is
                                  easily performed using step functions.

                                  Number of hidden neurons

                                  In practical applications one rarely encounters architectures more complex than a
                                  two-layer  network,  which,  as  we  have  seen,  is  capable  of  producing  solutions
                                  arbitrarily  close  to the optimal one. For datasets that are hard  to train with  two-
                                  layer networks, a three-layer solution can be  tried, usually with a low number of
                                  hidden  neurons  (2 or  3)  in  the  third  layer.  Concerning  the  number  of  hidden
                                  neurons  in  the  first  layer,  it  can  be  proved  (Bishop,  1995) that  their  role is  to
                                  perform  a  dimensional  reduction,  much  in  the  way  of  the  Fisher  discriminant
                                  transformation  described  in  section  4.1.4.  It  is  expected,  therefore,  that  their
                                  number will be near the number of significant eigenvectors of the data covariance
                                  matrix. Bearing this in mind, it must be said that there is no fast guiding rule that
                                  will substitute for experimentation.

                                  Learning parameters

                                  As previously seen, the learning rate controls how large the weight adjustment is in
                                  each iteration. A large learning rate may lead to faster convergence, but may also
                                  cause strong oscillations near the optimal solution, or even divergence. It is normal
                                  to choose values below 0.5, and for non-trivial problems it is  advisable to use a
   183   184   185   186   187   188   189   190   191   192   193