Page 32 - Machine Learning for Subsurface Characterization
P. 32
Unsupervised outlier detection techniques Chapter 1 17
for pixel intensity in an image, which is bound within a range due to the data
acquisition requirements, and is suitable for amplitude of a speech signal, which
is bound within a range. The presence of outliers adversely affects the Standard
scaler and severely affects the MinMax scaler. Robust scaler overcomes the
limitations of MinMax scaler and Standard scaler by using the first and third
quartiles for scaling the features instead of the minimum, mean, and maximum
values. The use of quartiles ensures that the robust scaler is not sensitive to out-
liers, whereas the minimum and maximum values used in the MinMax scaler
could be the outliers and the mean and standard deviation values used in the
Standard scaler are influenced by outliers.
4.3 Validation dataset
We created four distinct validation datasets containing known real/synthetic
outliers to assess and compare the performances of the four unsupervised ODTs
studied in this chapter. Being unsupervised methods, there is no direct way of
quantifying the performances of isolation forest, local outlier factor, DBSCAN,
and one-class SVM. Therefore, we implement expert knowledge, physically
consistent thresholds, and various synthetic data creation methods to assign
an outlier or inlier label to each sample in the validation dataset. Several sam-
ples in the labeled validation dataset are synthetic samples generated using
physically consistent formulations. Each of the four validation datasets is pro-
cessed by the each of the four unsupervised ODTs; following that, the inliers
and outliers detected by the unsupervised ODT are compared with the prespe-
cified outlier/inlier labels assigned to the samples of the validation dataset by
the human expert.
4.3.1 Dataset #1: Containing noisy measurements
Dataset #1 was constructed from the previously mentioned onshore dataset to
compare the performance of the four unsupervised outlier detection techniques
in detecting depths where the log responses are adversely affected by noise.
Noise in well-log dataset can adversely affect its geological/geophysical inter-
pretation as it masks the formation properties. The onshore dataset contains log
00
responses measured at 5617 depths in Well 1 drilled with a bit of size 7.875 .
Dataset #1 comprise gamma ray (GR), bulk density (RHOB), compressional
sonic travel time (DTC), and deep resistivity (RT) logs from the onshore dataset
00
for the depths, where the borehole diameter is between 7.8 and 8.2 . This led to
00
4037 inliers in Dataset #1. Following that, synthetic noisy log responses for 200
additional depths were randomly introduced into the Dataset #1. The noise sam-
ples were created such that they belong to the same distribution as the inlier data
but are two standard deviations away from the mean of each feature in Dataset
#1. Consequently, Dataset #1 contains in total 4237 samples, out of which 200
are point outliers. Comparative study on Dataset #1 involved experiments with