Page 60 - Machine Learning for Subsurface Characterization
P. 60
46 Machine learning for subsurface characterization
clustering, which starts by assuming each sample as a cluster and then groups
similar objects (clusters or samples) into higher-level clusters, such that the
newly formed higher-level clusters are distinct and the objects within each
cluster are similar. Hierarchical clustering builds a hierarchy of clusters
(represented as dendrogram) from low-level clusters to high-level clusters
based on a user-specified distance metric and linkage criteria. Linkage criteria
are used to define similarity between two clusters and decide which two
clusters to combine for generating new higher-hierarchy clusters. Few popular
linkage criteria are single linkage, complete linkage, average linkage, centroid
linkage, and Ward’s linkage. Single linkage combines two clusters that have
minimum minimally separated samples between the two clusters. Single
linkage is suitable for nonglobular clusters, tends to generate elongated
clusters, and gets affected by noise. Complete linkage combines two clusters
that have minimum maximally separated samples between the two clusters.
Complete linkage is not suitable for nonglobular clusters, tends to generate
globular clusters, and resistant to noise. Ward’s linkage combines two clusters
such that the new cluster results in the smallest increase in the variance.
Ward’s linkage generates dense cluster concentrated toward the middle,
whereas marginal samples/points are few and relatively scattered. Both the
distance metric and linkage criteria need to be defined in accordance to the
phenomena/processes that generated the dataset. For example, when clustering
accident sites or regions of high vehicular traffic in a dense urban city, we
should use Manhattan distance instead of Euclidian distance as the distance
metric. Another example is when dataset is generated due to slowly changing
process (e.g., civilization), a single linkage is most suited to group similar
clusters (e.g., archeological objects).
Few advantages of agglomerative clustering are as follows:
1. It is suitable when the underlying data have structure, order, and
interdependencies (like the correlations in financial markets).
2. It generates a hierarchy that facilitates selection of number of clusters in the
dataset.
3. It allows generation of clusters at user-defined granularity by searching
through the dendrogram.
4. Unlike K-means, agglomerative (hierarchical) clustering does not require
the user to specify the number of clusters prior to applying the algorithm.
Few disadvantages of agglomerative clustering are as follows:
1. It is necessary to specify both the distance metric and the linkage criteria,
which are selected without any strong theoretical basis.
2. Compared with K-means that is linear in computational time, hierarchical
clustering techniques are quadratic, that is, computationally expensive
and slow.