Page 25 - Machine Learning for Subsurface Characterization
P. 25

10   Machine learning for subsurface characterization


            hyperplane). An optimization routine is used to process the available data to
            select certain samples as support vectors that parameterize the decision bound-
            ary defining the hypersphere to be used for outlier detection [7].
               OCSVM implementation is challenging for high-dimensional data, is slower
            to train and deploy, tends to overfit, is suitable when fraction of outlier is small,
            and needs careful tuning of the hyperparameters. OCSVM requires feature scal-
            ing and dimensionality reduction for fast training. Important hyperparameters
            of OCSVM are the gamma and outlier fraction. The gamma influences the
            radius of the Gaussian hypersphere that separates the inliers from outliers; large
            values of gamma will result in smaller hypersphere and “stricter” model that
            finds more outliers. It acts as the cutoff parameter for the Gaussian hypersphere
            that governs the separating boundary between inliers and outliers [8]. Outlier
            fraction defines the percentage of the dataset that is outlier. Outlier fraction
            helps in creating tighter decision boundary to improve outlier detection. Similar
            to Fig. 1.1A, Fig. 1.1B illustrates the working of the one-class SVM where the
            interfaces of two different shades are few possible decision functions that can be
            used for outlier detection. Fig. 1.1B illustrates the outlier detection by the
            OCSVM when applied to a simple two-dimensional dataset containing 25 sam-
            ples having two features/attributes. Red samples (gray in the print version) are
            outliers, and the shade of blue (light gray in the print version) in the background
            is indicative of degree of normality of samples lying in the shaded region, where
            darker blue shades (dark gray in the print version) correspond to outliers that
            are easy to partition. OCSVM is effective in detecting both point and collective
            outliers when tuned properly. The ability of OCSVM to detect contextual outlier
            depends on appropriate feature selection, which can be time-consuming.

            3.3 DBSCAN

            Density-based spectral clustering of applications with noise (DBSCAN) is a
            density-based clustering algorithm that can be used as an unsupervised ODT.
            The density of a region depends on the number of samples in that region and
            the proximity of the samples to each other. DBSCAN seeks to find regions
            of high density separated by low-density regions in a dataset. Samples in the
            high-density regions are labeled as inliers, whereas those in low-density regions
            are labeled as outliers. The key idea is that for each sample in the inlier cluster,
            the neighborhood region of certain user-defined size (referred as bandwidth)
            must contain at least a minimum number of samples, that is, the density in
            the neighborhood must exceed a user-defined threshold [9]. Samples that do
            not meet the density threshold are labeled as outliers. DBSCAN requires the
            tuning of the following hyperparameters that control the outlier detection pro-
            cess: minimum number of samples required to form the inlier cluster; maximum
            distance between any two samples in an inlier cluster; and parameter p
            that determines the distance measure in the form of the Minkowski distance,
            such that Minkowski distance transforms into Euclidean distance for p ¼ 2.
   20   21   22   23   24   25   26   27   28   29   30