Page 27 - Machine Learning for Subsurface Characterization
P. 27

12   Machine learning for subsurface characterization


            approach and relative density calculation, LOF is able to identify outliers in a
            dataset that would not be outliers in another area of the dataset. The major
            hyperparameters for tuning are the number of neighbors K to consider for
            each sample and metric p for measuring the distance, similar to DBSCAN,
            where the general form of Minkowski distance transforms into Euclidean dis-
            tance for p ¼ 2.


            3.5 Influence of hyperparameters on the unsupervised ODTs
            Hyperparameters are the parameters of the model that are defined by the user
            prior to training/applying the model on the data. For example, the number of
            layers and number of neurons in a layer are the hyperparameters of neural net-
            work, and the number of trees and the maximum depth of a tree are the hyper-
            parameters of random forest. Hyperparameters control the learning process; for
            unsupervised ODTs, hyperparameters determine the decision boundaries (e.g.,
            OCSVM), partitions (e.g., IF), similarity/dissimilarity labels (e.g., DBSCAN),
            and scores (e.g., LOF) that differentiate the inliers from outliers. By changing
            the hyperparameters, we can effectively alter the performance of an unsuper-
            vised ODT. The effects of hyperparameters on the unsupervised ODT are evi-
            dent when comparing Fig. 1.1 with Fig. 1.2.
               When the IF hyperparameter “contamination” is changed from “auto” (i.e.,
            the model labels outliers based on the default threshold of the model) to 0.1,
            the isolation forest model is constrained to label only 10% of the dataset ( 3
            samples) as outliers (Fig. 1.2A), thereby worsening the outlier detection for
            the synthetic 25-sample dataset. IF model therefore labels the “top 3” outliers
            based on the learnt decision functions, as shown in Fig. 1.2A. When the
            OCSVM hyperparameter “nu” (maximum fraction of possible outliers in
            the dataset) is changed from 0.5 (Fig. 1.1B) to 0.2 (Fig. 1.2B), OCSVM model
            becomes more conservative in outlier detection and detects few outliers;
            thereby improving the outlier detection for the synthetic 25-sample dataset
            (Fig. 1.1Bvs Fig. 1.2B). When the LOF hyperparameter “number of neigh-
            bors” is reduced, the effect of the high-density regions on the scores assigned
            to samples in low-density regions is also reduced, thereby improving outlier
            detection for the synthetic 25-sample dataset (Fig. 1.1Cvs Fig. 1.2C). Finally,
            when the value of epsilon (bandwidth) is increased from 0.5 (in Fig. 1.1D) to 1
            (in Fig. 1.2D), all the data points are considered as inliers (Fig. 2D) because all
            samples can now meet the density requirement, such that several samples
            form an independent cluster; thereby worsening the outlier detection for the
            synthetic 25-sample dataset. Increasing bandwidth means that a sample
            belongs to an inlier cluster even when it is in a low-density region (i.e., band-
            width defines the minimum number of neighboring samples within a maxi-
            mum distance around a sample above which the sample can be considered
            as an inlier cluster).
   22   23   24   25   26   27   28   29   30   31   32