Page 212 -
P. 212
200 5 Neural Networks
We are therefore guaranteed that, as we use larger training sets, we will obtain
an error estimate that comes arbitrarily close to the true error of the classifier,
whatever value it may have. This result applies to the general risk functional (5-59)
and its empirical risk estimation, ped (w), also known as empirical error. The
convergence behaviour with increasing n is known as the principle of empirical
error minimization. We have already applied the ERM principle in the previous
chapter, when discussing the application of the Bayes rule for minimum risk.
Usually, we would also like the network (and respective weight vector) that
minimizes the empirical error to also minimize the true error. After all, having an
excellent estimate of a bad classifier is not of much use! In order to achieve that,
the following more restrictive condition, known as uniform convergence of the
errors, must apply:
with
where sup(x) (supremum of x) of a family of scalar values x, is the smallest scalar
larger than any value of the family. What formula (5-61) tells us is that we want to
obtain an arbitrarily small deviation of the empirical error from the true error, using
the supremum of the deviations for a family of networks. The Vapnik-
Chervonenkis dimension allows us to establish the following limit for small values
of Pe(w) (Vapnik, 1998):
Let us denote by 1-a the confidence level that we have on a classifier that its
error is not deviated from the optimal value more than E. This corresponds to the
so-called probably approximately correct (PAC) principle (see e.g. Mitchell,
1997). Then, using formulas (5-62) and (5-61), the following bound holds at the
1-a confidence level:
In order to study the influence of these terms, let us consider fixed values for a
and n and analyse what happens with increasing values of dYC. For an increasing
dyC the empirical error will decrease, since the neural network will be better
adjusted to the training set up to the limit of having as many hidden neurons as
there are patterns. The second term, however, will increase with dvc, representing a
structural risk of non-generalization. It works, therefore, as a regularization term
that penalizes the network complexity. In general, for any loss function, we will