Page 175 - Statistics and Data Analysis in Geology
P. 175
Statistics and Data Analysis in Geology - Chapter 6
their characteristics and similarities. Taxonomy is highly subjective and depen-
dent upon the individual taxonomist’s skills, developed through years of experi-
ence. In this respect, the field is analogous in many ways to geology. As in geology,
researchers dissatisfied with the subjectivity and capriciousness of traditional
methods have sought new techniques of classification which incorporate the mas-
sive data-handling capabilities of the computer. These workers, responsible for
many of the advances made in numerical classification, call themselves numerical
taxonomists.
Numerical taxonomy has been a center of controversy in biology, much like the
suspicion that swirled around factor analysis in the 1930’s and 1940’s and provoked
acrimonious debates among psychologists. As in that dispute, the techniques of
numerical taxonomy were overzealously promoted by some practitioners. In ad-
dition, it was claimed that a numerically derived taxonomy better represented the
phylogeny of a group of organisms than could any other type of classification. Al-
though this has yet to be demonstrated, rapid progress in genotyping suggests that
an objective phylogeny may someday be possible. The conceptual underpinnings
of taxonomic methods such as cluster analysis are incomplete; the various cluster-
ing methods lie outside the body of multivariate statistical theory, and only limited
tests of significance are available (Hartigan, 1975; Milligan and Cooper, 1986; Bock,
1996). Although cluster analysis has become an accepted tool for researchers and
there are an increasing number of books on the subject, a more complete statis-
tical basis for classification has yet to be fashioned. In spite of this, many of the
methods of numerical taxonomy are important in geologic research, especially in
the classification of fossil invertebrates and the study of paleoenvironments.
The purpose of cluster analysis is to assemble observations into relatively ho-
mogeneous groups or “clusters,” the members of which are at once alike and at
the same time unlike members of other groups. There is no analytical solution to
this problem, which is common to all areas of classification, not just numerical tax-
onomy. Although there are alternative classifications of classification procedures
(Sneath and Sokal, 1973; Gordon, 1999), most may be grouped into four general
types.
1. Partitioning methods operate on the multivariate observations themselves, or
on projections of these observations onto planes of lower dimension. Basically,
these methods cluster by finding regions in the space defined by the m vari-
ables that are poorly populated with observations, and that separate densely
populated regions. Mathematical “partitions” are placed in the sparse regions,
subdividing the variable space into discrete classes. Although the analysis
is done in the m-dimensional space defined by the variables rather than the
n-dimensional space defined by the observations, it proceeds iteratively and
may be extremely time-consuming (Aldenderfer and Blashfield, 1984; Gordon,
1999).
2. Arbitrary origin methods operate on the similarity between the observations
and a set of arbitrary starting points. If n observations are to be classified
into k groups, it is necessary to compute an asymmetric n x k matrix of sim-
ilarities between the n samples and the k arbitrary points that serve as initial
group centroids. The observation closest or most similar to a starting point is
combined with it to form a cluster. Observations are iteratively added to the
nearest cluster, whose centroid is then recalculated for the expanded cluster.
488