Page 177 - Statistics and Data Analysis in Geology
P. 177

Statistics and Data Analysis in Geology - Chapter 6

             measurement units. As an extreme example, we might measure three perpendicular
             axes on a collection of pebbles. If we measure two of the axes in centimeters and the
             third in millimeters, the third axis will have proportionally ten times more influence
             on the distance coefficient than either of  the other two variables.
                 Other measures  of  similarity that  are less commonly used in the Earth sci-
             ences include a wide variety of  association coefficients which are based on binary
             (presence-absence) variables or a combination of binary and continuous variables.
             The most popular of  these are the simple  matching  coefficient, Jaccard’s coeffi-
             cient, and Cower’s coefficient-all  ratios of  the presence-absence  of  properties.
             They differ primarily in the way that mutual absences (called “negative matches”)
             are considered.  Sneath and Sokal (1973) discuss the relative merits of  these and
             other coefficients of association. Probabilistic similarity coefficients are used with
             binary data and consider the gain or loss of information when objects are combined
             into clusters. Again, Sneath and Sokal(1973) provide a comprehensive summary.
                 Computation of  a similarity measurement between all possible pairs of objects
             will result in an n x n symmetrical matrix, C. Any coefficient Cij in the matrix gives
             the resemblance between objects i and j. The next step is to arrange the objects
             into a hierarchy so objects with the highest mutual similarity are placed together.
             Then groups or clusters of  objects are associated with other groups which they
             most closely resemble, and so on until all of  the objects have been placed into a
             complete classification scheme. Many variants of clustering have been developed; a
             consideration of  all of  the possible alternative procedures and their relative merits
             is beyond the scope of  this book.  Rather, we will discuss one simple clustering
             technique called the weighted pair-group method with arithmetic averaging, and
             then point out some useful modifications to this scheme.
                 Extensive discussions of  hierarchical and other classification techniques  are
             contained in books by Jardine and Sibson (1971), Sneath and Sokal (1973), Har-
             tigan (19751, Aldenderfer and Blashfield (1984), Romesburg (1984), Kaufman and
             Rousseeuw (1990), Backer (1995), and Gordon (1999). Diskettes containing cluster-
             ing programs are included in some of the these books or are available separately at
             modest cost. In addition, most personal computer programs for statistical analysis
             contain modules for hierarchical clustering.
                 Table 6-8  contains measurements made on six greywacke thin sections, iden-
             tified as A, B, . . . , F.  The values represent the average of  the apparent maximum
             diameters of  ten randomly chosen grains of  quartz, rock fragment, and feldspar
             and the average of  the apparent maximum diameters of  ten intergranular pores in
             each thin section.  The table also gives a symmetric matrix of  similarities, in the
             form of  “correlation” coefficients calculated between the six thin sections.
                 The first  step in clustering by  a pair-group  method is to find the mutually
             highest  correlations in the matrix to  form the centers of  clusters.  The highest
             correlation (disregarding the diagonal element) in each column of  the matrix in
             Table 6-8  is shown in boldface type. Specimens A and B form mutually high pairs,
             because A most closely resembles B, and B most closely resembles A. C and D also
             form mutually high pairs.  E most closely resembles D, but these two do not form
             a mutually high pair because D  resembles C more than it does E.  To qualify as a
             mutually high pair, coefficients Cij and Cji must be the highest coefficients in their
             respective columns.
                 We can indicate the resemblance between our mutually high pairs in a diagram
             such as Figure 6-5  a.  Object C is connected to D  at a level of?  = 0.99, indicating

             490
   172   173   174   175   176   177   178   179   180   181   182