Page 49 -

P. 49

2.3 The Covariance Matrix 3 5

transformation of x with a diagonal matrix, amounting to a multiplication of each
feature x, by some quantity a,, would now be scaled by a new variance a:s:,
therefore preserving Ilyllv However, this simple scaling method would fail to
preserve distances for the general linear transformation, such as the one illustrated
in Figure 2.1 1.
In order to have a distance measure that will be ~nvariant to linear
transformations we need to first consider the notion of covariance, an extension of
the more popular variance notion, measuring the tendency of two features x, and x,
varying in the same direction. The covariance between features x, and xi is
estimated as follows for n patterns:

Notice that covariances are symmetric, c, = c,,, and that c,, is in fact the usual
estimation of the variance of x,.
The covariance is related to the well-known Pearson correlation, estimated as:

Therefore, the correlation can be interpreted as a standardized covariance.
Looking at Figure 2.9, one may rightly guess that circular clusters have no
privileged direction of variance, i.e., they have equal variance along any direction.
Consider now the products v4 = (xk,, - mi)(xk, - mj) . For any feature vector
yielding a given vi, value, it is a simple matter for a sufficiently large population to
find another, orthogonal, feature vector yielding - vii The v, products therefore
cancel out (the variation along one direction is uncorrelated with the variation in
any other direction), resulting in a covariance that apart from a scale factor is the
unit matrix, C=I.
Let us now turn to the elliptic clusters shown in Figure 2.1 1. For such ellipses,
with the major axis subtending a positive angle measured in an anti-clockwise
direction from the abscissas, one will find more and higher positive vii values along
directions around the major axis than negative vii values along directions around
the minor axis, therefore resulting in a positive cross covariance c12 = c2, . If the
major axis subtends a negative angle the covariance is negative. The higher the
covariance, the "thinner" the ellipsis (feature vectors concentrated around the major
axis). In the cork stoppers example of Figure 2.1 3, the correlation (and therefore
also the covariance) between N and PRTlO is high: 0.94.
Given a set of n patterns we can compute all the covariances using formula
(2-15), and then establish a symmetric covariance matrix:

44 45 46 47 48 49 50 51 52 53 54