Page 177 - Statistics and Data Analysis in Geology
P. 177
Statistics and Data Analysis in Geology - Chapter 6
measurement units. As an extreme example, we might measure three perpendicular
axes on a collection of pebbles. If we measure two of the axes in centimeters and the
third in millimeters, the third axis will have proportionally ten times more influence
on the distance coefficient than either of the other two variables.
Other measures of similarity that are less commonly used in the Earth sci-
ences include a wide variety of association coefficients which are based on binary
(presence-absence) variables or a combination of binary and continuous variables.
The most popular of these are the simple matching coefficient, Jaccard’s coeffi-
cient, and Cower’s coefficient-all ratios of the presence-absence of properties.
They differ primarily in the way that mutual absences (called “negative matches”)
are considered. Sneath and Sokal (1973) discuss the relative merits of these and
other coefficients of association. Probabilistic similarity coefficients are used with
binary data and consider the gain or loss of information when objects are combined
into clusters. Again, Sneath and Sokal(1973) provide a comprehensive summary.
Computation of a similarity measurement between all possible pairs of objects
will result in an n x n symmetrical matrix, C. Any coefficient Cij in the matrix gives
the resemblance between objects i and j. The next step is to arrange the objects
into a hierarchy so objects with the highest mutual similarity are placed together.
Then groups or clusters of objects are associated with other groups which they
most closely resemble, and so on until all of the objects have been placed into a
complete classification scheme. Many variants of clustering have been developed; a
consideration of all of the possible alternative procedures and their relative merits
is beyond the scope of this book. Rather, we will discuss one simple clustering
technique called the weighted pair-group method with arithmetic averaging, and
then point out some useful modifications to this scheme.
Extensive discussions of hierarchical and other classification techniques are
contained in books by Jardine and Sibson (1971), Sneath and Sokal (1973), Har-
tigan (19751, Aldenderfer and Blashfield (1984), Romesburg (1984), Kaufman and
Rousseeuw (1990), Backer (1995), and Gordon (1999). Diskettes containing cluster-
ing programs are included in some of the these books or are available separately at
modest cost. In addition, most personal computer programs for statistical analysis
contain modules for hierarchical clustering.
Table 6-8 contains measurements made on six greywacke thin sections, iden-
tified as A, B, . . . , F. The values represent the average of the apparent maximum
diameters of ten randomly chosen grains of quartz, rock fragment, and feldspar
and the average of the apparent maximum diameters of ten intergranular pores in
each thin section. The table also gives a symmetric matrix of similarities, in the
form of “correlation” coefficients calculated between the six thin sections.
The first step in clustering by a pair-group method is to find the mutually
highest correlations in the matrix to form the centers of clusters. The highest
correlation (disregarding the diagonal element) in each column of the matrix in
Table 6-8 is shown in boldface type. Specimens A and B form mutually high pairs,
because A most closely resembles B, and B most closely resembles A. C and D also
form mutually high pairs. E most closely resembles D, but these two do not form
a mutually high pair because D resembles C more than it does E. To qualify as a
mutually high pair, coefficients Cij and Cji must be the highest coefficients in their
respective columns.
We can indicate the resemblance between our mutually high pairs in a diagram
such as Figure 6-5 a. Object C is connected to D at a level of? = 0.99, indicating
490