Page 172 - Machine Learning for Subsurface Characterization

P. 172

146 Machine learning for subsurface characterization

samples, one from each cluster. The merging process forms a hierarchical tree
of clusters. In Fig. 5.5B, most of the cluster numbers are 0, which indicates the
hierarchical clustering, finds most of the samples to be similar, and groups most
of the formation depths into one cluster. The results shown in Figs. 5.5B and 5.8
demonstrate that the hierarchical cluster algorithm does not do a good job in
differentiating the formation depths.

2.5.4 DBSCAN clustering
DBSCAN is a density-based clustering method. Unlike the K-means clustering
the DBSCAN method do not need the user to manually define the number of
clusters. Instead, it requires a user to define the minimum number of
neighbors to be considered in a cluster and the maximum allowed distance
between any two points for them to be a part of the same cluster. Within a
certain user-defined distance around a sample, the DBSCAN will count the
number of neighbors. When the number of neighbors within the specified
distance (i.e., data density) exceeds a threshold, DBSCAN will identify that
group of data points as belonging to one cluster. Based on our extensive
study, we set minimum number of neighbors as 100 and the range of
distance as 10. Fig. 5.5C shows that DBSCAN clustering method identifies
many data points as outliers, which are clustered into cluster number1 and
most of the formation depths are clustered into cluster number 0 (Fig. 5.8).

2.5.5 SOM followed by K-means clustering
Self-organizing map (SOM) is a neural network-based dimensionality reduction
algorithm generally used to represent a high-dimensional dataset as two-
dimensional discretized pattern. Reduction in dimensionality is performed
while retaining the topology of data present in the original feature space. In
this study, we perform SOM dimensionality reduction followed by K-means
clustering. The clustering method is basically a K-means clustering
performed on the mapping generated by SOM. As the first step, artificial
neural network is trained to generate low-dimensional discretized
representation of the data in the original feature space while preserving the
topological properties; this is achieved through competitive learning. In
SOM, the vectors that are close in the high-dimensional space also end up
being mapped to SOM nodes that are close in low-dimensional space.
K-means can be considered a simplified case of SOM, wherein the nodes
(centroids) are independent from each other. K-means is highly sensitive to
the initial positions of the centroids, and it is not suitable for high-
dimensional dataset. The two-stage procedure for clustering adopted in this
study first uses SOM to produce the low-dimensional prototypes
(abstractions) that are then clustered in the second stage using K-means. This
two-step clustering method reduces the computational time and improves the
efficiency of K-means clustering. Even with relatively small number of
samples, many clustering algorithms—especially hierarchical ones—become

167 168 169 170 171 172 173 174 175 176 177