Page 61 - Machine Learning for Subsurface Characterization
P. 61
Characterization of fracture-induced geomechanical alterations Chapter 2 47
3. Computation time significantly increases with the increase in data size and
dimensionality.
4. Compared with K-means, the hierarchical clustering does not provide the
best/optimum solution because there is no objective function;
consequently, hierarchical clustering is difficult to implement and interpret.
5. Unlike K-means and single/complete linkage, Ward’s linkage distorts the
feature space and is not space conserving.
6. It does not allow backtracking and object swapping between clusters. Once
a certain label is assigned to a sample or to a cluster containing a certain
sample, the subsequent labels are assigned to that sample depending on
the prior labels and the hierarchy [11].
7. It emphasizes a collection of samples over individual samples when
generating new clusters.
4.3 DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) is based
on the assumption that clusters are dense regions in the feature space separated
by lower-density regions [12]. DBSCAN uses proximity and density of samples
to form clusters. Each sample in the clusters identified by DBSCAN have at least
a minimum number of neighboring samples (nmin) within a certain distance
(depends on the user-specified bandwidth). When implementing DBSCAN,
user needs to specify values for nmin and bandwidth that are suited for a
given dataset. DBSCAN algorithm starts by computing pair-wise distances
between all samples. Following that, each sample in the dataset is labeled as
either core, border, or noise point based on the user-specified minimum
number of neighboring samples (nmin) within a certain distance (depends on
the user-specified bandwidth) around the sample. In doing so, any sample
with at least a certain number of neighbors within a certain distance is
marked as core point, and any sample within the neighborhood of a core
point but with less than a certain number of neighbors within a certain
distance is marked as border point. All points that are neither core or border
points are marked as noise. Following that, DBSCAN randomly selects a
core point (not assigned to any cluster) and recursively finds all density-
connected points, which are assigned to the same cluster as the randomly
selected core point. These steps are iterated till all samples are assigned a
cluster label or marked as outlier. A user needs to carefully select the optimal
values of bandwidth and nmin. Small bandwidth and large nmin values will
result in several sparsely distributed, diffused clusters, where several samples
are marked as noise points. Large bandwidth will generate few large clusters.
For noisy datasets, it is recommended to have larger values of nmin.
K-distance graph method is used to select the optimal bandwidth.