Page 160 -
P. 160
148 5 Neural Networks
feature (bias) of value one, as shown in Figure 5.1. We will also use w to denote
the whole weight vector, unless we need to refer explicitly to the bias term. This
discriminant unit, whose input variables are the features and whose output is the
linear function d(x), is also called a linear network.
When presenting graphically connectionist structures, such as the one in Figure
5.1, we will use an open circle to represent a processing neuron and a black circle
to represent a terminal neuron. In the case of a single output linear network, as
shown in Figure 5.1, there is only one processing unit where the contributions from
all inputs are summed up.
In general, we will have available c such functions dk(x) with weight vector wb
one for each class, therefore for each pattern xi we write:
As in section 4.1.2, the class label assigned to an unknown pattern corresponds
to the decision function reaching a maximum for that pattern.
Imagine now that we wanted to adjust the weights of these linear functions in
order to approximate some target outputs tk(x) for each class wk. We could do this
in the following way:
First, for each feature vector xi, we compute the deviations of each discriminant
output from the target values:
Next, these deviations or approximation errors, are squared and summed in
order to obtain a total error, E:
In this last formula we simplified the writing by using tki instead of tk(xi). We
have also included a one half factor whose relevance is merely to ease subsequent
derivations. ~ote' that equation (5-2a) can be viewed as the total dissimilarity
between output values and the desired target values using a squared Euclidian
metric.
Adding the squares of the deviations, as we did in (5-2a), imposes a stronger
penalty on the larger ones. Other formulas for E are also possible, for instance
using the absolute values of the deviations instead of the squares. However, the
sum-of-squares error has the desirable property of easy differentiation and also
well-established physical and statistical interpretations. For instance, if our linear
network had to approximate voltage values, and the voltage deviations obtained
were applied to resistances of the same value, the heat generated by them would be
proportional to E. It seems, therefore, appropriate to call E the error energy. In