Page 176 - Statistics and Data Analysis in Geology
P. 176

Analysis of Multivariate  Data

               3.  Mutual similarity procedures group together observations that have a common
                 similarity to other observations.  First an n x n matrix of  similarities between
                 all pairs of  observations is calculated.  Then the similarity between columns
                 of this matrix is iteratively recomputed.  Columns representing members of  a
                 single cluster will tend to have intercorrelations near +1, while having much
                 lower correlations with nonmembers.
              4. Hierarchical clustering joins the most similar observations, then successively
                 connects the next most similar observations to these. First an n x n matrix of
                 similarities between all pairs of  observations is calculated. Those pairs having
                 the highest similarities are then merged, and the matrix is recomputed.  This
                 is done by averaging the similarities that the combined observations have with
                 other observations. The process iterates until the similarity matrix is reduced
                 to 2 x 2. The progression of levels of similarity at which observations merge is
                 displayed as a dendrogram.
                 Hierarchical clustering techniques are most widely applied in the Earth sci-
             ences, probably because their development has been closely linked with the numer-
             ical taxonomy of  fossil organisms.  Because of  the widespread use of  heirarchical
             techniques, we will consider them in some detail.
                 Suppose we have a collection of  objects we wish to arrange into a hierarchical
             classification.  In biology, these objects are referred to as “operational taxonomic
             units” or OW’S (Sneath and Sokal, 1973). We  can make a series of  measurements
             on each object which constitutes our data set. If we have n objects and measure m
             characteristics, the observations form an nx m data matrix, X. Next, some measure
             of resemblance or similarity must be computed between every pair of objects; that
             is, between the rows of  the data matrix.  Several coefficients of resemblance have
             been used, including a variation of the correlation coefficient fij in which the roles
             of  objects and variables are interchanged.  This can be done by transposing X so
             rows become columns and vice versa, then calculating fij in the conventional man-
             ner (Eq. 2.28; p. 43), following the matrix algorithm given in Chapter 3. Although
             called “correlation,” this measure is not really a correlation coefficient in the con-
             ventional sense because it involves “means” and “variances” calculated across all
             the variables measured on two objects, rather than the means and variances of two
             variables.
                 Another commonly used measure of  similarity between objects is a standard-
             ized m-space Euclidean distance, dij. The distance coefficient is computed by


                                                                                   (6.40)


             where Xik denotes the kth variable measured on object i and xjk is the kth variable
             measured on object j. In all, m variables are measured on each object, and dij is
             the distance between object i and object j. As you would expect, a small distance
             indicates the two objects are similar or “close together,” whereas a large distance
             indicates  dissimilarity.  Commonly, each element in the n x m raw data matrix
             X is standardized by subtracting the column means and dividing by the column
             standard deviations prior to computing distance measurements.  This ensures that
             each variable is weighted equally. Otherwise, the distance will be influenced most
             strongly by the variable which has the greatest magnitude. In some instances this
             may be desirable, but unwanted effects can creep in through injudicious choice of

                                                                                      489
   171   172   173   174   175   176   177   178   179   180   181