Page 61 - Machine Learning for Subsurface Characterization
P. 61

Characterization of fracture-induced geomechanical alterations Chapter  2 47


             3. Computation time significantly increases with the increase in data size and
                dimensionality.
             4. Compared with K-means, the hierarchical clustering does not provide the
                best/optimum  solution  because  there  is  no  objective  function;
                consequently, hierarchical clustering is difficult to implement and interpret.
             5. Unlike K-means and single/complete linkage, Ward’s linkage distorts the
                feature space and is not space conserving.
             6. It does not allow backtracking and object swapping between clusters. Once
                a certain label is assigned to a sample or to a cluster containing a certain
                sample, the subsequent labels are assigned to that sample depending on
                the prior labels and the hierarchy [11].
             7. It emphasizes a collection of samples over individual samples when
                generating new clusters.


             4.3 DBSCAN
             Density-based spatial clustering of applications with noise (DBSCAN) is based
             on the assumption that clusters are dense regions in the feature space separated
             by lower-density regions [12]. DBSCAN uses proximity and density of samples
             to form clusters. Each sample in the clusters identified by DBSCAN have at least
             a minimum number of neighboring samples (nmin) within a certain distance
             (depends on the user-specified bandwidth). When implementing DBSCAN,
             user needs to specify values for nmin and bandwidth that are suited for a
             given dataset. DBSCAN algorithm starts by computing pair-wise distances
             between all samples. Following that, each sample in the dataset is labeled as
             either core, border, or noise point based on the user-specified minimum
             number of neighboring samples (nmin) within a certain distance (depends on
             the user-specified bandwidth) around the sample. In doing so, any sample
             with at least a certain number of neighbors within a certain distance is
             marked as core point, and any sample within the neighborhood of a core
             point but with less than a certain number of neighbors within a certain
             distance is marked as border point. All points that are neither core or border
             points are marked as noise. Following that, DBSCAN randomly selects a
             core point (not assigned to any cluster) and recursively finds all density-
             connected points, which are assigned to the same cluster as the randomly
             selected core point. These steps are iterated till all samples are assigned a
             cluster label or marked as outlier. A user needs to carefully select the optimal
             values of bandwidth and nmin. Small bandwidth and large nmin values will
             result in several sparsely distributed, diffused clusters, where several samples
             are marked as noise points. Large bandwidth will generate few large clusters.
             For noisy datasets, it is recommended to have larger values of nmin.
             K-distance graph method is used to select the optimal bandwidth.
   56   57   58   59   60   61   62   63   64   65   66