Page 220 - MATLAB Recipes for Earth Sciences

P. 220

216 9 Multivariate Statistics

thogonal coordinate system, where the ﬁrst axis passes through the long axis
of the data scatter and the new origin is the bivariate mean. This new refer-

ence frame has the advantage that the ﬁrst axis can be used to describe most
of the variance, while the second axis contributes only a little. Originally,
two axis were needed to describe the data set prior to the transformation. It
is therefore possible to reduce the data dimension by dropping the second
axis without losing much information as shown in Figure 9.1.
This is now expanded to an arbitrary number of variables and samples.
Suppose a data set of measurements of p parameters on n samples stored in
an n-by-p array.

The columns of the array represent the p variables, the rows represent the n
samples. After rotating the axis and moving the origin, the new coordinates
can be computed by

The PC denoted by Y contains the greatest variance, PC the second high-
1 1 2
est variance and so forth. All PCs together contain the full variance of the
data set. The variance is concentrated in the ﬁrst few PCs, which explain

most of the information content of the data set. The last PCs are generally
ignored to reduce the data dimension. The factors a in the above equations
ij
are the principal component loads. The values of these factors represent the
relative contribution of the original variables to the new PCs. If the load a
ij
of a variable X in PC is close to zero, the inﬂuence of this variable is low.

1 1
A high positive or negative a suggest a strong contribution of the variable
ij
X . The new values of the variables computed from the linear combinations
1
of the original variables weighted by the loads are called the principal com-
ponent scores.
In the following, a synthetic data set is used to illustrate the use of the func-
tion princomp contained in the Statistics Toolbox. Our data set contains the

215 216 217 218 219 220 221 222 223 224 225