Page 53 -
P. 53
2.3 The Covariance Matrix 39
Suppose that we apply a linear transformation with the transpose of Z, u = Z'y,
to the vectors y with covariance C. Then, as shown in Appendix C, the new
covariance matrix will be diagonal (uncorrelated features), having the squares of
the eigenvalues as new variances and preserving the Mahalanobis distances. This
orthonormal transjormation is also called Karhunen-Loeve transformation.
Not~ce that we have been using the eigenvectors computed from the original
linear transformation matrix A. Usually we do not know this matrix, and instead,
we will compute the eigenvectors of C itself, which are the same. The
corresponding eigenvalues of C are, however, the square of the ones computed
from A, and represent the variances along the eigenvectors or, equivalently, the
eigenvectors of A represent the standard deviations, as indicated in Figure 2.15. It
is customary to sort the eigenvectors by decreasing magnitude, the first one
corresponding to the direction of maximum variance, as shown in Figure 2.15, the
second one to the direction of the maximum remaining variance, and so on, until
the last one (/12 in Figure 2.15) representing only a residual variance.
Appendix C also includes the demonstration of the positive definiteness of the
covariance matrices and other equally interesting results.
2.4 Principal Components
As mentioned in the previous chapter the initial choice of features in PR problems
is mostly guided by common sense ideas of pattern discrimination. Therefore, it is
not unusual to have a large initial set constituted by features that may exhibit high
correlations among them, and whose contribution to pattern discrimination may
vary substantially. Large feature sets are inconvenient for a number of reasons, an
obvious one being the computational burden. Less obvious and more compelling
reasons will be explained later. A common task in PR problems is therefore to
perform some type of feature selection. In the initial phase of a PR project, such
selection aims to either discard features whose contribution is insignificant as will
be described in section 2.5, or to perform some kind of dimensionality reduction by
using an alternative and smaller set of features derived from the initial ones.
Principal components analysis is a method commonly used for data reduction
purposes. It is based on the idea of performing an orthonormal transformation as
described in the previous section, retaining only significant eigenvectors. As
explained in the section on orthonormal transformation, each eigenvector is
associated with a variance represented by the corresponding eigenvalue. Each
eigenvector corresponding to an eigenvalue that represents a significant variance of
the whole dataset is called a principal component of the data. For instance, in the
example portrayed in Figure 2.15 the first eigenvector represents /i: /(/i: + 4) =
98% of the total variance; in short, z, alone contains practically all information
about the data.
Let us see how this works in a real situation. Figure 2.16 shows the sorted list of
the eigenvalues (classes w,, y) of the cork stoppers data, computed with Statistica.
The ninth eigenvalue, for instance, is responsible for about 0.01% of the total