Page 164 - Statistics and Data Analysis in Geology

P. 164

Analysis of Multivariate Data

Tests of significance
If we are willing to make some assumptions about the nature of the data used in
the discriminant function, we can test the significance of the separation between
the two groups. Five basic assumptions about the data are necessary: (a) the ob-
servations in each group are randomly chosen, (b) the probability of an unknown
observation belonging to either group is equal, (c) variables are normally distributed
within each group, (d) the variance-covariance matrices of the groups are equal in
size, and (e) none of the observations used to calculate the function were misclas-
sified. Of these, the most difficult to justify are (b), (c), and (d). Fortunately, the
discriminant function is not seriously affected by limited departures from normal-
ity or by limited inequality of variances. Justification of (b) must depend upon
a priori assessment of the relative abundance of the groups under examination. If
the assumption of equal abundance seems unjustified, a different assumption may
be made, which will shift the position of Ro. [See Anderson (1984, chapter 6) for
an extensive discussion of alternative decision rules for discrimination.]
The first step in a test of the significance of a discriminant function is to mea-
sure the separation or distinctness of the two groups. This can be done by comput-
ing the distance between the centroids, or multivariate means, of the groups. The
measure of distance is derived directly from univariate statistics. We can obtain a
measure of the difference between the means of two univariate samples, XI and Xz,
by simply subtracting one from the other. However, this difference is expressed
in the same units as the original observations. If the difference is divided by the
pooled standard deviation, we obtain a standardized difference in which the dif-
ference between the means of the two groups is expressed in dimensionless units
of standard deviation, or z-scores:

(6.20)
JP
When both sides of Equation (6.20) are squared, the denominator is the pooled
variance of the two samples, s; :

(6.21)

Suppose that instead of a single variable, two variables are measured on each
observation in the two groups. The difference between the bivariate means of the
two groups can be expressed as the ordinary Euclidean, or straight-line, distance
between them. Again denoting the two groups as A and B,

Euclidean distance = J(X1 -El)' + (& - 8')' (6.22)

In general, if m variables are measured on each observation, the straight-line
distance between the multivariate means of the two groups is

Euclidean distance = (Xj -Bj)' (6.23)

2
The square of the Euclidean distance is Cj"=, (Xj - Ej) ; you can verify that this is
the same as the matrix product,

Euclidean distance' = D'D (6.24)

477

159 160 161 162 163 164 165 166 167 168 169