Page 27 - Machine Learning for Subsurface Characterization
P. 27
12 Machine learning for subsurface characterization
approach and relative density calculation, LOF is able to identify outliers in a
dataset that would not be outliers in another area of the dataset. The major
hyperparameters for tuning are the number of neighbors K to consider for
each sample and metric p for measuring the distance, similar to DBSCAN,
where the general form of Minkowski distance transforms into Euclidean dis-
tance for p ¼ 2.
3.5 Influence of hyperparameters on the unsupervised ODTs
Hyperparameters are the parameters of the model that are defined by the user
prior to training/applying the model on the data. For example, the number of
layers and number of neurons in a layer are the hyperparameters of neural net-
work, and the number of trees and the maximum depth of a tree are the hyper-
parameters of random forest. Hyperparameters control the learning process; for
unsupervised ODTs, hyperparameters determine the decision boundaries (e.g.,
OCSVM), partitions (e.g., IF), similarity/dissimilarity labels (e.g., DBSCAN),
and scores (e.g., LOF) that differentiate the inliers from outliers. By changing
the hyperparameters, we can effectively alter the performance of an unsuper-
vised ODT. The effects of hyperparameters on the unsupervised ODT are evi-
dent when comparing Fig. 1.1 with Fig. 1.2.
When the IF hyperparameter “contamination” is changed from “auto” (i.e.,
the model labels outliers based on the default threshold of the model) to 0.1,
the isolation forest model is constrained to label only 10% of the dataset ( 3
samples) as outliers (Fig. 1.2A), thereby worsening the outlier detection for
the synthetic 25-sample dataset. IF model therefore labels the “top 3” outliers
based on the learnt decision functions, as shown in Fig. 1.2A. When the
OCSVM hyperparameter “nu” (maximum fraction of possible outliers in
the dataset) is changed from 0.5 (Fig. 1.1B) to 0.2 (Fig. 1.2B), OCSVM model
becomes more conservative in outlier detection and detects few outliers;
thereby improving the outlier detection for the synthetic 25-sample dataset
(Fig. 1.1Bvs Fig. 1.2B). When the LOF hyperparameter “number of neigh-
bors” is reduced, the effect of the high-density regions on the scores assigned
to samples in low-density regions is also reduced, thereby improving outlier
detection for the synthetic 25-sample dataset (Fig. 1.1Cvs Fig. 1.2C). Finally,
when the value of epsilon (bandwidth) is increased from 0.5 (in Fig. 1.1D) to 1
(in Fig. 1.2D), all the data points are considered as inliers (Fig. 2D) because all
samples can now meet the density requirement, such that several samples
form an independent cluster; thereby worsening the outlier detection for the
synthetic 25-sample dataset. Increasing bandwidth means that a sample
belongs to an inlier cluster even when it is in a low-density region (i.e., band-
width defines the minimum number of neighboring samples within a maxi-
mum distance around a sample above which the sample can be considered
as an inlier cluster).