Page 59 - Machine Learning for Subsurface Characterization
P. 59

Characterization of fracture-induced geomechanical alterations Chapter  2 45


             then treated as a new cluster center, and samples are reassigned cluster labels
             based on the newly computed nearest cluster centers. This requires
             calculations of distances of all cluster centers from all the samples in the
             dataset. The objective of K-means algorithm is to minimize the average sum
             of the squared Euclidean distances of samples from their corresponding
             cluster centers. Iterative refinement of cluster centers and cluster labels of
             samples are then performed to optimize the positions of cluster centers till the
             algorithm reaches one of the stopping criteria:

             1. Convergence of the cluster centers, such that cluster centers negligibly
                change over several iterations.
             2. Number of iterations reaches a maximum value specified by the user.
             3. Variance of each cluster changes negligibly over several iterations.
             4. Average squared Euclidean distances of samples from their cluster centers
                reaches local minima.
             For each iteration of the algorithm, a new cluster is generated based only on the
             old clusters. When there are n samples and k clusters to be formed, there are k n
             possible clusters. Quality of final clusters, identified using K-means, is
             quantified using inertia or silhouette score. Few limitations of K-means
             algorithms include the following:

             1. Only suitable for globular, isotropic, well-separated, and equally sized
                clusters.
             2. Number of clusters needs to be predefined by the user. Generally, the elbow
                method is used to identify optimum number of cluster centers, but it can be a
                computationally expensive process.
             3. A single application of K-means generates nonunique clusters and cluster
                centers.
             4. Being a distance-based method, K-means needs feature scaling and
                dimensionality reduction.
             5. Not suitable for high-dimensional dataset, where each sample has a large
                number of features/attributes. K-Means is not suitable for high-
                dimensional dataset because more samples will be equidistant to each
                other and to cluster centers with increase in dimensions/features. In other
                words, for higher-dimensional dataset, the concept of distance becomes a
                weak metric for quantifying similarity between samples.
             6. Computationally expensive for large-sized dataset because each adjustment
                of cluster center requires calculations of distances of all cluster centers from
                all the samples in the dataset.


             4.2 Agglomerative clustering

             Hierarchical clustering can be broadly categorized into agglomerative clustering
             (bottom-up) and divisive clustering (top-down). Agglomerative clustering is
             a subset of hierarchical clustering. In our study, we use agglomerative
   54   55   56   57   58   59   60   61   62   63   64