Page 158 - Statistics and Data Analysis in Geology
P. 158
Analysis of Multivariate Data
Discri m i na nt Functions
One of the most widely used multivariate procedures in Earth science is the dis-
criminant function. We will consider it at length for two reasons: discrimination is
a powerful statistical tool and it can be regarded as either a way to treat univariate
problems related to multiple regression, ‘or multivariate problems related to the
statistical tests we will discuss later. Discriminant functions therefore provide an
additional link between univariate and multivariate statistics.
First, however, we must define the process of discrimination, and carefully
distinguish it from the related process of classification. Suppose we have assembled
two collections of shale samples of known freshwater and saltwater origin. We
may have determined their origin from an examination of their fossil content. A
number of geochemical variables have been measured on each specimen, including
the content of vanadium, boron, iron, and so forth. The problem is to find the linear
combination of these variables that produces the maximum difference between the
two previously defined groups. If we find a function that produces a significant
difference, we can use it to allocate new specimens of shale of unknown origin to
one of the two original groups. In other words, new shale samples, not containing
diagnostic fossils, can then be categorized as marine or freshwater on the basis of
the linear discriminant function of their geochemical components. [This problem
was considered by Potter, Shimp, and Witters (1963).]
Classification can be illustrated with a similar example. Suppose we have ob-
tained a large, heterogeneous collection of shale specimens, each of which has been
geochemically analyzed. On the basis of the measured variables, can the shales be
separated into groups (or clusters, as they are commonly called) that are both rel-
atively homogeneous and distinct from other groups? The process by which this
can be done has been highly developed by numerical taxonomists, and will be con-
sidered in a later section. There are several obvious differences between these pro-
cedures and those of discriminant function analysis. A classification is internally
based; that is, it does not depend on a priori knowledge about relations between
observations as does a discriminant function. The number of groups in a discrim-
inant function is set prior to the analysis, while in contrast the number of clusters
that will emerge from a classification scheme cannot ordinarily be predetermined.
Similarly, each original observation is defined as belonging to a specific group in
a discriminant analysis. In most classification procedures, an observation is free
to enter any cluster that emerges. Other differences will become apparent as we
examine these two procedures. The result of a cluster analysis of shales would be
a classification of the observations into several groups. It would then be up to us
to interpret the geological meaning (if any) of the groups so found.
A simple linear discriminant function transforms an original set of measure-
ments on a specimen into a single discriminant score. That score, or transformed
variable, represents the specimen’s position along a line defined by the linear dis-
criminant function. We can therefore think of the discriminant function as a way
of collapsing a multivariate problem down into a problem which involves only one
variable.
Discriminant function analysis consists of finding a transform which gives the
maximum ratio of the difference between two group multivariate means to the
multivariate variance within the two groups. If we regard our two groups as form-
ing clusters of points in multivariate space, we must search for the one orienta-
tion along which the two clusters have the greatest separation while each cluster
471