Page 59 - Machine Learning for Subsurface Characterization
P. 59
Characterization of fracture-induced geomechanical alterations Chapter 2 45
then treated as a new cluster center, and samples are reassigned cluster labels
based on the newly computed nearest cluster centers. This requires
calculations of distances of all cluster centers from all the samples in the
dataset. The objective of K-means algorithm is to minimize the average sum
of the squared Euclidean distances of samples from their corresponding
cluster centers. Iterative refinement of cluster centers and cluster labels of
samples are then performed to optimize the positions of cluster centers till the
algorithm reaches one of the stopping criteria:
1. Convergence of the cluster centers, such that cluster centers negligibly
change over several iterations.
2. Number of iterations reaches a maximum value specified by the user.
3. Variance of each cluster changes negligibly over several iterations.
4. Average squared Euclidean distances of samples from their cluster centers
reaches local minima.
For each iteration of the algorithm, a new cluster is generated based only on the
old clusters. When there are n samples and k clusters to be formed, there are k n
possible clusters. Quality of final clusters, identified using K-means, is
quantified using inertia or silhouette score. Few limitations of K-means
algorithms include the following:
1. Only suitable for globular, isotropic, well-separated, and equally sized
clusters.
2. Number of clusters needs to be predefined by the user. Generally, the elbow
method is used to identify optimum number of cluster centers, but it can be a
computationally expensive process.
3. A single application of K-means generates nonunique clusters and cluster
centers.
4. Being a distance-based method, K-means needs feature scaling and
dimensionality reduction.
5. Not suitable for high-dimensional dataset, where each sample has a large
number of features/attributes. K-Means is not suitable for high-
dimensional dataset because more samples will be equidistant to each
other and to cluster centers with increase in dimensions/features. In other
words, for higher-dimensional dataset, the concept of distance becomes a
weak metric for quantifying similarity between samples.
6. Computationally expensive for large-sized dataset because each adjustment
of cluster center requires calculations of distances of all cluster centers from
all the samples in the dataset.
4.2 Agglomerative clustering
Hierarchical clustering can be broadly categorized into agglomerative clustering
(bottom-up) and divisive clustering (top-down). Agglomerative clustering is
a subset of hierarchical clustering. In our study, we use agglomerative