Page 160 -
P. 160

148    5 Neural Networks


                         feature (bias) of  value one, as shown in  Figure 5.1. We will also use w to denote
                         the whole weight vector, unless we need to refer explicitly to the bias term. This
                         discriminant unit, whose input variables are the features and whose output is the
                          linear function d(x), is also called a linear network.
                            When presenting graphically connectionist structures, such as the one in Figure
                          5.1, we will use an open circle to represent a processing neuron and a black circle
                          to represent a terminal neuron. In  the case of  a single output linear network, as
                          shown in Figure 5.1, there is only one processing unit where the contributions from
                          all inputs are summed up.
                            In general, we will have available c such functions dk(x) with weight vector wb
                          one for each class, therefore for each pattern xi we write:






                            As in section 4.1.2, the class label assigned to an unknown pattern corresponds
                          to the decision function reaching a maximum for that pattern.
                            Imagine now  that we wanted to adjust the weights of  these linear functions in
                          order to approximate some target outputs tk(x) for each class wk.  We could do this
                          in the following way:
                            First, for each feature vector xi, we compute the deviations of each discriminant
                          output from the target values:




                            Next,  these  deviations or  approximation  errors,  are  squared  and  summed in
                          order to obtain a total error, E:







                            In  this last formula we simplified the writing by  using  tki instead of  tk(xi). We
                          have also included a one half factor whose relevance is merely to ease subsequent
                          derivations.  ~ote' that  equation  (5-2a)  can  be  viewed  as  the  total  dissimilarity
                          between  output  values  and  the  desired  target  values using  a  squared  Euclidian
                          metric.
                            Adding the  squares of  the deviations, as we did  in  (5-2a), imposes a stronger
                          penalty  on  the  larger ones. Other formulas for E are also possible, for  instance
                          using the absolute values of  the deviations instead of  the squares. However, the
                          sum-of-squares error  has  the  desirable property  of  easy  differentiation and  also
                          well-established physical and  statistical interpretations. For instance, if  our linear
                          network  had  to approximate voltage values, and the voltage deviations obtained
                          were applied to resistances of the same value, the heat generated by them would be
                          proportional to E.  It  seems, therefore, appropriate to call E the  error energy. In
   155   156   157   158   159   160   161   162   163   164   165