Page 220 - MATLAB Recipes for Earth Sciences
P. 220

216                                              9 Multivariate Statistics


            thogonal coordinate system, where the first axis passes through the long axis
            of the data scatter and the new origin is the bivariate mean. This new refer-

            ence frame has the advantage that the first axis can be used to describe most
            of the variance, while the second axis contributes only a little. Originally,
            two axis were needed to describe the data set prior to the transformation. It
            is therefore possible to reduce the data dimension by dropping the second
            axis without losing much information as shown in Figure 9.1.
               This is now expanded to an arbitrary number of variables and samples.
            Suppose a data set of measurements of p parameters on n samples stored in
            an n-by-p array.










            The columns of the array represent the p variables, the rows represent the n
            samples. After rotating the axis and moving the origin, the new coordinates
            can be computed by










            The PC  denoted by Y  contains the greatest variance, PC  the second high-
                   1            1                              2
            est variance and so forth. All PCs together contain the full variance of the
            data set. The variance is concentrated in the first few PCs, which explain

            most of the information content of the data set. The last PCs are generally
            ignored to reduce the data dimension. The factors a in the above equations
                                                          ij
            are the principal component  loads. The values of these factors represent the
            relative contribution of the original variables to the new PCs. If the load a
                                                                              ij
            of a variable X  in PC  is close to zero, the influence of this variable is low.

                         1      1
            A high positive or negative a suggest a strong contribution of the variable
                                      ij
            X . The new values of the variables computed from the linear combinations
              1
            of the original variables weighted by the loads are called the principal com-
            ponent  scores.
               In the following, a synthetic data set is used to illustrate the use of the func-
            tion princomp contained in the Statistics Toolbox. Our data set contains the
   215   216   217   218   219   220   221   222   223   224   225