Page 30 - Machine Learning for Subsurface Characterization

P. 30

Unsupervised outlier detection techniques Chapter 1 15

the steps laid out in Fig 1.3A. To ensure a controlled environment for our inves-
tigation, we created four distinct validation datasets comprising outlier/inlier
labels, which were assigned by a human expert. The ability of the unsupervised
ODTs to accurately detect the outliers and inliers is analyzed using various eval-
uation metrics. It is to be noted that real-world implementations of unsupervised
ODTs are generally done without any prior knowledge of outliers and inliers by
following the steps laid out in Fig. 1.3B; consequently, there is no way to eval-
uate the unsupervised ODTs during real-world implementations and choose the
best one. Nonetheless, our comparative study will help identify the unsuper-
vised ODT that performs the best on various types of well-log dataset with min-
imal hyperparameter tuning.

4.1 Description of the dataset used for the comparative study
of unsupervised ODTs
Log data used for this work were obtained from two wells in different reservoirs.
Gamma ray (GR), density (RHOB), neutron porosity (NPHI), compressional
sonic travel time (DTC), and deep and shallow resistivity logs (RT and
RXO) from Well 1 are available within the depth interval of 580–5186 ft com-
prising 5617 depth samples; herein, this dataset will be referred as the onshore
dataset. The onshore dataset contains log responses from different lithologies of
limestone, sandstone, dolomite, and shale. Gamma ray (GR), density (DEN),
neutron porosity (NEU), compressional sonic transit time (AC), deep and
medium resistivities (RDEP and RMED), and photoelectric factor (PEF) logs
from Well 2 are available within the depth interval of 8333–13327 ft compris-
ing 9986 depth samples; herein, this dataset will be referred as the offshore data-
set. The offshore dataset contains log responses from different lithologies of
limestones, sandstone, dolomite, shale, and anhydrites.

4.2 Data preprocessing
Data preprocessing refers to the transformations applied to data before feeding
them to the machine learning algorithm [15]. Primary use of data preprocessing
is to convert the raw data into a clean dataset that the machine learning work-
flow can process. A few data preprocessing tasks include fixing null/NaN
values, imputing missing values, scaling the features, normalizing samples,
removing anomalies, encoding the qualitative/nominal categorical features,
and data reformatting. Data preprocessing is an important step because a
data-driven model built using machine learning is as good as the quality of data
processed by the model.

4.2.1 Feature transformation: Convert R to log(R)
Machine learning models tend to be more efficient when the features/attributes
are not skewed and have relatively similar distribution and variance. Resistivity

25 26 27 28 29 30 31 32 33 34 35