Page 212 -
P. 212

200      5 Neural Networks


                                We are therefore guaranteed that, as we use larger training sets, we will obtain
                              an  error  estimate  that  comes  arbitrarily  close  to  the  true  error  of  the  classifier,
                              whatever value it may have. This result applies to the general risk functional (5-59)
                              and  its  empirical risk  estimation,  ped (w),  also known  as  empirical  error. The
                              convergence behaviour  with  increasing n  is known  as the principle  of empirical
                              error minimization. We have  already applied the ERM principle  in  the previous
                              chapter, when discussing the application of the Bayes rule for minimum risk.
                                Usually,  we  would  also  like the  network  (and  respective  weight  vector) that
                              minimizes the empirical error to also minimize the true error. After all, having an
                              excellent estimate of  a bad classifier is not of much use! In order to achieve that,
                              the  following  more  restrictive  condition,  known  as  uniform  convergence  of  the
                              errors, must apply:


                                                               with


                              where sup(x) (supremum of x) of a family of scalar values x, is the smallest scalar
                              larger than any value of the family. What formula (5-61) tells us is that we want to
                              obtain an arbitrarily small deviation of the empirical error from the true error, using
                              the  supremum  of  the  deviations  for  a  family  of  networks.  The  Vapnik-
                              Chervonenkis dimension allows us to establish the following limit for small values
                              of Pe(w) (Vapnik, 1998):






                                Let us denote by  1-a the confidence level that we  have on a classifier that its
                              error is not deviated from the optimal value more than  E.  This corresponds to the
                              so-called  probably  approximately  correct  (PAC)  principle  (see  e.g.  Mitchell,
                              1997). Then, using formulas  (5-62) and (5-61), the following bound holds at the
                              1-a confidence level:







                                In order to study the influence of  these terms, let us consider fixed values for a
                              and  n and analyse what happens with increasing values of dYC. For an increasing
                              dyC the  empirical  error  will  decrease,  since  the  neural  network  will  be  better
                              adjusted  to the training set up to the limit of having as many  hidden  neurons as
                              there are patterns. The second term, however, will increase with dvc, representing a
                              structural risk of  non-generalization. It works, therefore,  as  a regularization term
                              that penalizes  the network complexity. In  general, for any loss function, we  will
   207   208   209   210   211   212   213   214   215   216   217