Page 18 - Machine Learning for Subsurface Characterization

P. 18

Unsupervised outlier detection techniques Chapter 1 3

unsupervised outlier detection techniques (ODTs) on various original and syn-
thetic well-log datasets.

1.1 Basic terminologies in machine learning and data-driven models
Before discussing more about outliers, the authors would like to clearly distin-
guish the following terms: dataset, sample, feature, and target. Data-driven
(DD) and machine learning-based (ML) methods find statistical/probabilistic
functions by processing a relevant dataset to either relate features to targets
(referred as supervised learning) or appropriately transform features and/or sam-
ples (referred as unsupervised learning). Various types of information (i.e., values
of features and targets) about several samples constitute a dataset. A dataset is a
collection of values corresponding to features and/or targets for several samples.
Features are physical properties or attributes that can be measured or computed
for each sample in the dataset. Targets are the observable/measurable outcomes,
and the target values for a sample are consequences of certain combinations of
features for that sample. For purposes of unsupervised learning, a relevant dataset
is collection of only the features for all the available samples, whereas a dataset is
collection of features and corresponding targets for all the available samples for
purposes of supervised learning. A dataset comprises of one or many targets and
several features for several samples. An increase in the number of samples
increases the size of the dataset, whereas an increase in the number of features
increases the dimensionality of dataset. A DD/ML model becomes more robust
with the increase in the size of the dataset. However, with increase in dimension
of the dataset, a model tends to overfit and becomes less generalizable, unless the
increase in dimension is due to the addition of informative, relevant, uncorrelated
features. Prior to building the DD/ML model using supervised learning, a dataset
is split into training and testing datasets to ensure the model does not overfit the
training dataset and generalizes well to the testing dataset. Further, the training
dataset is divided into certain number of splits to perform cross validation that
ensures the model learns from and is evaluated on all the statistical distributions
present in the training dataset. For evaluating the model on the testing dataset, it is
of utmost importance to avoid any form of mixing (leakage) between the training
and testing datasets. Also, when evaluating the model on the testing dataset, one
should select relevant evaluation metrics out of the several available metrics with
various assumptions and limitations.

1.2 Types of machine learning techniques
Machine learning (ML) models can be broadly categorized into three tech-
niques: supervised learning, unsupervised learning, and reinforcement learning.
In supervised learning (e.g., regression and classification), a data-driven model
is developed by first training the model on samples with known features/attri-
butes and corresponding targets/outcomes from the training dataset; following
that, the trained model is evaluated on the testing dataset; and finally, the data-

13 14 15 16 17 18 19 20 21 22 23