Page 175 - Statistics and Data Analysis in Geology
P. 175

Statistics and Data Analysis in  Geology - Chapter 6

             their  characteristics and similarities.  Taxonomy is highly subjective and depen-
             dent upon the individual taxonomist’s skills, developed through years of  experi-
             ence. In this respect, the field is analogous in many ways to geology. As in geology,
             researchers  dissatisfied with  the  subjectivity  and  capriciousness  of  traditional
             methods have sought new techniques of  classification which incorporate the mas-
             sive data-handling capabilities of  the computer.  These workers, responsible  for
             many of  the advances made in numerical classification, call themselves numerical
             taxonomists.
                 Numerical taxonomy has been a center of controversy in biology, much like the
             suspicion that swirled around factor analysis in the 1930’s and 1940’s and provoked
             acrimonious debates among psychologists.  As in that dispute, the techniques of
             numerical taxonomy were overzealously promoted by some practitioners.  In ad-
             dition, it was claimed that a numerically derived taxonomy better represented the
             phylogeny of  a group of  organisms than could any other type of  classification. Al-
             though this has yet to be demonstrated, rapid progress in genotyping suggests that
             an objective phylogeny may someday be possible.  The conceptual underpinnings
             of  taxonomic methods such as cluster analysis are incomplete; the various cluster-
             ing methods lie outside the body of multivariate statistical theory, and only limited
             tests of  significance are available (Hartigan, 1975; Milligan and Cooper, 1986; Bock,
              1996). Although cluster analysis has become an accepted tool for researchers and
             there are an increasing number of  books on the subject, a more complete statis-
             tical basis for classification has yet to be fashioned.  In spite of  this, many of  the
             methods of  numerical taxonomy are important in geologic research, especially in
             the classification of  fossil invertebrates and the study of paleoenvironments.
                 The purpose of  cluster analysis is to assemble observations into relatively ho-
             mogeneous groups or “clusters,” the members of  which are at once alike and at
             the same time unlike members of  other groups. There is no analytical solution to
             this problem, which is common to all areas of classification, not just numerical tax-
             onomy.  Although there are alternative classifications of  classification procedures
             (Sneath and Sokal, 1973; Gordon, 1999), most may be grouped into four general
             types.
               1. Partitioning methods operate on the multivariate observations themselves, or
                  on projections of these observations onto planes of lower dimension. Basically,
                  these methods cluster by finding regions in the space defined by the m vari-
                  ables that are poorly populated with observations, and that separate densely
                 populated regions. Mathematical “partitions” are placed in the sparse regions,
                  subdividing the variable space into discrete classes.  Although  the analysis
                 is done in the m-dimensional space defined by the variables rather than the
                  n-dimensional space defined by the observations, it proceeds iteratively and
                  may be extremely time-consuming (Aldenderfer and Blashfield, 1984; Gordon,
                  1999).
               2. Arbitrary origin methods operate on the similarity between the observations
                  and a set of  arbitrary starting points.  If  n observations are to be classified
                  into k groups, it is necessary to compute an asymmetric n x k matrix of  sim-
                  ilarities between the n samples and the k arbitrary points that serve as initial
                  group centroids. The observation closest or most similar to a starting point is
                  combined with it to form a cluster.  Observations are iteratively added to the
                  nearest cluster, whose centroid is then recalculated for the expanded cluster.

              488
   170   171   172   173   174   175   176   177   178   179   180