Page 18 - Machine Learning for Subsurface Characterization
P. 18

Unsupervised outlier detection techniques Chapter  1 3


             unsupervised outlier detection techniques (ODTs) on various original and syn-
             thetic well-log datasets.


             1.1 Basic terminologies in machine learning and data-driven models
             Before discussing more about outliers, the authors would like to clearly distin-
             guish the following terms: dataset, sample, feature, and target. Data-driven
             (DD) and machine learning-based (ML) methods find statistical/probabilistic
             functions by processing a relevant dataset to either relate features to targets
             (referred as supervised learning) or appropriately transform features and/or sam-
             ples (referred as unsupervised learning). Various types of information (i.e., values
             of features and targets) about several samples constitute a dataset. A dataset is a
             collection of values corresponding to features and/or targets for several samples.
             Features are physical properties or attributes that can be measured or computed
             for each sample in the dataset. Targets are the observable/measurable outcomes,
             and the target values for a sample are consequences of certain combinations of
             features for that sample. For purposes of unsupervised learning, a relevant dataset
             is collection of only the features for all the available samples, whereas a dataset is
             collection of features and corresponding targets for all the available samples for
             purposes of supervised learning. A dataset comprises of one or many targets and
             several features for several samples. An increase in the number of samples
             increases the size of the dataset, whereas an increase in the number of features
             increases the dimensionality of dataset. A DD/ML model becomes more robust
             with the increase in the size of the dataset. However, with increase in dimension
             of the dataset, a model tends to overfit and becomes less generalizable, unless the
             increase in dimension is due to the addition of informative, relevant, uncorrelated
             features. Prior to building the DD/ML model using supervised learning, a dataset
             is split into training and testing datasets to ensure the model does not overfit the
             training dataset and generalizes well to the testing dataset. Further, the training
             dataset is divided into certain number of splits to perform cross validation that
             ensures the model learns from and is evaluated on all the statistical distributions
             present in the training dataset. For evaluating the model on the testing dataset, it is
             of utmost importance to avoid any form of mixing (leakage) between the training
             and testing datasets. Also, when evaluating the model on the testing dataset, one
             should select relevant evaluation metrics out of the several available metrics with
             various assumptions and limitations.


             1.2 Types of machine learning techniques
             Machine learning (ML) models can be broadly categorized into three tech-
             niques: supervised learning, unsupervised learning, and reinforcement learning.
             In supervised learning (e.g., regression and classification), a data-driven model
             is developed by first training the model on samples with known features/attri-
             butes and corresponding targets/outcomes from the training dataset; following
             that, the trained model is evaluated on the testing dataset; and finally, the data-
   13   14   15   16   17   18   19   20   21   22   23