Page 164 - Statistics and Data Analysis in Geology
P. 164

Analysis of Multivariate Data

             Tests of significance
             If we  are willing to make some assumptions about the nature of  the data used in
             the discriminant function, we can test the significance of  the separation between
             the two groups.  Five basic assumptions about the data are necessary: (a) the ob-
             servations in each group are randomly chosen, (b) the probability of  an unknown
             observation belonging to either group is equal, (c) variables are normally distributed
             within each group, (d) the variance-covariance matrices of  the groups are equal in
             size, and (e) none of  the observations used to calculate the function were misclas-
             sified.  Of  these, the most difficult to justify are (b), (c), and (d). Fortunately, the
             discriminant function is not seriously affected by limited departures from normal-
             ity or by  limited inequality of  variances.  Justification of  (b) must depend upon
             a priori assessment of  the relative abundance of  the groups under examination. If
             the assumption of  equal abundance seems unjustified, a different assumption may
             be made, which will shift the position of  Ro.  [See Anderson (1984, chapter 6) for
             an extensive discussion of  alternative decision rules for discrimination.]
                 The first step in a test of  the significance of a discriminant function is to mea-
             sure the separation or distinctness of  the two groups. This can be done by comput-
             ing the distance between the centroids, or multivariate means, of  the groups. The
             measure of  distance is derived directly from univariate statistics. We can obtain a
             measure of the difference between the means of two univariate samples,  XI and Xz,
             by  simply subtracting one from the other.  However, this difference is expressed
             in the same units as the original observations. If  the difference is divided by the
             pooled standard deviation, we  obtain a standardized  difference in which the dif-
             ference between the means of  the two groups is expressed in dimensionless units
             of  standard deviation, or z-scores:

                                                                                   (6.20)
                                                    JP
                 When both sides of  Equation (6.20) are squared, the denominator is the pooled
             variance of  the two samples, s;  :

                                                                                   (6.21)

                 Suppose that instead of a single variable, two variables are measured on each
             observation in the two groups. The difference between the bivariate means of  the
             two groups can be expressed as the ordinary Euclidean, or straight-line, distance
             between them. Again denoting the two groups as A and B,

                              Euclidean distance = J(X1  -El)'  + (&  - 8')'       (6.22)

                 In general, if  m variables are measured on each observation, the straight-line
             distance between the multivariate means of  the two groups is

                              Euclidean distance =      (Xj -Bj)'                  (6.23)

                                                               2
             The square of  the Euclidean distance is Cj"=, (Xj - Ej) ; you can verify that this is
             the same as the matrix product,

                                       Euclidean distance'  = D'D                  (6.24)

                                                                                     477
   159   160   161   162   163   164   165   166   167   168   169