Page 32 - Machine Learning for Subsurface Characterization
P. 32

Unsupervised outlier detection techniques Chapter  1 17


             for pixel intensity in an image, which is bound within a range due to the data
             acquisition requirements, and is suitable for amplitude of a speech signal, which
             is bound within a range. The presence of outliers adversely affects the Standard
             scaler and severely affects the MinMax scaler. Robust scaler overcomes the
             limitations of MinMax scaler and Standard scaler by using the first and third
             quartiles for scaling the features instead of the minimum, mean, and maximum
             values. The use of quartiles ensures that the robust scaler is not sensitive to out-
             liers, whereas the minimum and maximum values used in the MinMax scaler
             could be the outliers and the mean and standard deviation values used in the
             Standard scaler are influenced by outliers.


             4.3 Validation dataset

             We created four distinct validation datasets containing known real/synthetic
             outliers to assess and compare the performances of the four unsupervised ODTs
             studied in this chapter. Being unsupervised methods, there is no direct way of
             quantifying the performances of isolation forest, local outlier factor, DBSCAN,
             and one-class SVM. Therefore, we implement expert knowledge, physically
             consistent thresholds, and various synthetic data creation methods to assign
             an outlier or inlier label to each sample in the validation dataset. Several sam-
             ples in the labeled validation dataset are synthetic samples generated using
             physically consistent formulations. Each of the four validation datasets is pro-
             cessed by the each of the four unsupervised ODTs; following that, the inliers
             and outliers detected by the unsupervised ODT are compared with the prespe-
             cified outlier/inlier labels assigned to the samples of the validation dataset by
             the human expert.

             4.3.1 Dataset #1: Containing noisy measurements
             Dataset #1 was constructed from the previously mentioned onshore dataset to
             compare the performance of the four unsupervised outlier detection techniques
             in detecting depths where the log responses are adversely affected by noise.
             Noise in well-log dataset can adversely affect its geological/geophysical inter-
             pretation as it masks the formation properties. The onshore dataset contains log
                                                                           00
             responses measured at 5617 depths in Well 1 drilled with a bit of size 7.875 .
             Dataset #1 comprise gamma ray (GR), bulk density (RHOB), compressional
             sonic travel time (DTC), and deep resistivity (RT) logs from the onshore dataset
                                                                  00
             for the depths, where the borehole diameter is between 7.8 and 8.2 . This led to
                                                           00
             4037 inliers in Dataset #1. Following that, synthetic noisy log responses for 200
             additional depths were randomly introduced into the Dataset #1. The noise sam-
             ples were created such that they belong to the same distribution as the inlier data
             but are two standard deviations away from the mean of each feature in Dataset
             #1. Consequently, Dataset #1 contains in total 4237 samples, out of which 200
             are point outliers. Comparative study on Dataset #1 involved experiments with
   27   28   29   30   31   32   33   34   35   36   37