Page 25 - Machine Learning for Subsurface Characterization

P. 25

10 Machine learning for subsurface characterization

hyperplane). An optimization routine is used to process the available data to
select certain samples as support vectors that parameterize the decision bound-
ary defining the hypersphere to be used for outlier detection [7].
OCSVM implementation is challenging for high-dimensional data, is slower
to train and deploy, tends to overfit, is suitable when fraction of outlier is small,
and needs careful tuning of the hyperparameters. OCSVM requires feature scal-
ing and dimensionality reduction for fast training. Important hyperparameters
of OCSVM are the gamma and outlier fraction. The gamma influences the
radius of the Gaussian hypersphere that separates the inliers from outliers; large
values of gamma will result in smaller hypersphere and “stricter” model that
finds more outliers. It acts as the cutoff parameter for the Gaussian hypersphere
that governs the separating boundary between inliers and outliers [8]. Outlier
fraction defines the percentage of the dataset that is outlier. Outlier fraction
helps in creating tighter decision boundary to improve outlier detection. Similar
to Fig. 1.1A, Fig. 1.1B illustrates the working of the one-class SVM where the
interfaces of two different shades are few possible decision functions that can be
used for outlier detection. Fig. 1.1B illustrates the outlier detection by the
OCSVM when applied to a simple two-dimensional dataset containing 25 sam-
ples having two features/attributes. Red samples (gray in the print version) are
outliers, and the shade of blue (light gray in the print version) in the background
is indicative of degree of normality of samples lying in the shaded region, where
darker blue shades (dark gray in the print version) correspond to outliers that
are easy to partition. OCSVM is effective in detecting both point and collective
outliers when tuned properly. The ability of OCSVM to detect contextual outlier
depends on appropriate feature selection, which can be time-consuming.

3.3 DBSCAN

Density-based spectral clustering of applications with noise (DBSCAN) is a
density-based clustering algorithm that can be used as an unsupervised ODT.
The density of a region depends on the number of samples in that region and
the proximity of the samples to each other. DBSCAN seeks to find regions
of high density separated by low-density regions in a dataset. Samples in the
high-density regions are labeled as inliers, whereas those in low-density regions
are labeled as outliers. The key idea is that for each sample in the inlier cluster,
the neighborhood region of certain user-defined size (referred as bandwidth)
must contain at least a minimum number of samples, that is, the density in
the neighborhood must exceed a user-defined threshold [9]. Samples that do
not meet the density threshold are labeled as outliers. DBSCAN requires the
tuning of the following hyperparameters that control the outlier detection pro-
cess: minimum number of samples required to form the inlier cluster; maximum
distance between any two samples in an inlier cluster; and parameter p
that determines the distance measure in the form of the Minkowski distance,
such that Minkowski distance transforms into Euclidean distance for p ¼ 2.

20 21 22 23 24 25 26 27 28 29 30