Page 23 - Machine Learning for Subsurface Characterization

P. 23

8 Machine learning for subsurface characterization

3.1 Isolation forest
Isolation forest (IF) assumes that the outliers tend to lie in sparse regions of the
feature space and have more empty space around them than the densely clus-
tered normal/inlier data [5]. Since outliers are in less populated regions of
the dataset, it generally takes fewer random partitions to isolate them in a seg-
ment/partition. In other words, since outliers are few and different, they are
more susceptible to isolation [6]. IF is an unsupervised ODT that uses a forest
of randomly partitioned trees to isolate outlier samples in terminating nodes. IF
performs recursive random partitioning/splitting of the feature space by ran-
domly subsampling features and corresponding threshold values of the features.
This generates treelike structure, where the number of splittings required to iso-
late an sample in a terminating node is equivalent to the path length from the
root node to the terminating node. This path length, averaged over a forest of
such random trees, is a measure of normality of a sample, such that anoma-
lies/outliers have noticeably shorter path lengths; in other words, it is easy to
partition the outliers with a few number of partitionings of the feature space.
A decision function categorizes each observation as an inlier or outlier based
on the path length of the observation compared with the average path length
of all observations. Unlike most other unsupervised ODTs that use distance
and density as measures for outlier detection, IF uses isolation as a measure.
IF has low computational requirements, is fast to deploy, has low computa-
tional time complexity, and can be parallelized for faster computation. IF does
not require feature scaling and dimensionality reduction. Like other tree-based
methods, IF does not need much tuning because hundreds of different trees
(with different subsamples of features and feature thresholds) are parallelly
trained on the dataset. Nonetheless, users who intend to control the performance
of outlier detection need to tune the following hyperparameters: amount of con-
tamination in the dataset, number of trees/estimators, maximum number of sam-
ples to be used in each tree, and maximum number of subsampled features used
in each tree. Hyperparameters govern the learning process. Fig. 1.1A illustrates
the outlier detection by the isolation forest when applied to a simple two-
dimensional dataset containing 25 samples having two features/attributes
(represented by the x- and y-axes). Red samples (gray in the print version)
are outliers, and the shade of blue (light gray in the print version) in the back-
ground is indicative of degree of normality of samples lying in the shaded
region, where darker blue shades (dark gray in the print version) correspond
to outliers that are easy to partition. Isolation forest is effective in detecting
point and tends to fail in detecting collective outliers and contextual outliers.

3.2 One-class SVM

One-class support vector machine (OCSVM) is a parametric unsupervised ODT
suitable when the data points (i.e., samples) are mostly “normal” data with very

18 19 20 21 22 23 24 25 26 27 28