Page 60 - Machine Learning for Subsurface Characterization
P. 60

46  Machine learning for subsurface characterization


            clustering, which starts by assuming each sample as a cluster and then groups
            similar objects (clusters or samples) into higher-level clusters, such that the
            newly formed higher-level clusters are distinct and the objects within each
            cluster are similar. Hierarchical clustering builds a hierarchy of clusters
            (represented as dendrogram) from low-level clusters to high-level clusters
            based on a user-specified distance metric and linkage criteria. Linkage criteria
            are used to define similarity between two clusters and decide which two
            clusters to combine for generating new higher-hierarchy clusters. Few popular
            linkage criteria are single linkage, complete linkage, average linkage, centroid
            linkage, and Ward’s linkage. Single linkage combines two clusters that have
            minimum minimally separated samples between the two clusters. Single
            linkage is suitable for nonglobular clusters, tends to generate elongated
            clusters, and gets affected by noise. Complete linkage combines two clusters
            that have minimum maximally separated samples between the two clusters.
            Complete linkage is not suitable for nonglobular clusters, tends to generate
            globular clusters, and resistant to noise. Ward’s linkage combines two clusters
            such that the new cluster results in the smallest increase in the variance.
            Ward’s linkage generates dense cluster concentrated toward the middle,
            whereas marginal samples/points are few and relatively scattered. Both the
            distance metric and linkage criteria need to be defined in accordance to the
            phenomena/processes that generated the dataset. For example, when clustering
            accident sites or regions of high vehicular traffic in a dense urban city, we
            should use Manhattan distance instead of Euclidian distance as the distance
            metric. Another example is when dataset is generated due to slowly changing
            process (e.g., civilization), a single linkage is most suited to group similar
            clusters (e.g., archeological objects).
               Few advantages of agglomerative clustering are as follows:
            1. It is suitable when the underlying data have structure, order, and
               interdependencies (like the correlations in financial markets).
            2. It generates a hierarchy that facilitates selection of number of clusters in the
               dataset.
            3. It allows generation of clusters at user-defined granularity by searching
               through the dendrogram.
            4. Unlike K-means, agglomerative (hierarchical) clustering does not require
               the user to specify the number of clusters prior to applying the algorithm.

            Few disadvantages of agglomerative clustering are as follows:
            1. It is necessary to specify both the distance metric and the linkage criteria,
               which are selected without any strong theoretical basis.
            2. Compared with K-means that is linear in computational time, hierarchical
               clustering techniques are quadratic, that is, computationally expensive
               and slow.
   55   56   57   58   59   60   61   62   63   64   65