Page 20 - Machine Learning for Subsurface Characterization
P. 20
Unsupervised outlier detection techniques Chapter 1 5
features. These outliers exist at the tail end of a distribution and largely vary from
the mean of the distribution, generally lying beyond 2 standard deviations away
from the mean; for example, subsurface depths where porosity is >40 porosity
units or permeability is >5 Darcy should be considered as point/global outliers.
From an event perspective, a house getting hit by a meteorite is an example of
point outlier. The second category of outliers is the contextual/conditional out-
liers, which deviate significantly from the data points within a specific context,
for example, a large gamma ray reading in sandstone due to an increase in
potassium-rich minerals (feldspar). Snow in summer is an example of contextual
outlier. Points labeled as contextual outliers are valid outliers only for a specific
context; a change in the context will result in a similar point to be considered as an
inlier. Collective outliers are a small cluster of data that as a whole deviate sig-
nificantly from the entire dataset, for example, log measurements from regions
affected by borehole washout. For example, it is not rare that people move from
one residence to the next; however, when an entire neighborhood relocates at the
same time, it will be considered as collective outlier. Contextual and collective
outliers need a domain expert to guide the outlier detection.
2 Outlier detection techniques
An outlier detection technique (ODT) is used to detect anomalous observations/
samples that do not fit the typical/normal statistical distribution of a dataset.
Simple methods for outlier detection use statistical tools, such as boxplot and
Z-score, on each individual feature of the dataset. A boxplot is a standardized
way of representing the distributions of samples corresponding to various fea-
tures using boxes and whiskers. The boxes represent the interquartile range of
the data, and the whiskers represent a multiple of the first and third quartiles of
the variable; any data point/sample outside these limits is considered as an out-
lier. The next simple statistical tool for feature-specific outlier detection is the
Z-score, which indicates how far the value of the data point/sample is from its
mean for a specific feature. A Z-score of 1 means the sample point is 1 standard
deviation away from its mean. Typically, Z-score values greater than or less
than +3 or 3, respectively, are considered outliers. Z-score is expressed as
x i x
Z score ¼ (1.1)
σ
where σ and x is the standard deviation and mean of the distribution of feature x,
respectively, and x i is the value of the feature x for the ith sample.
Outlier detection based on simple statistical tools generally assume that the
features have normal distributions while neglecting the correlation between fea-
tures in a multivariate dataset. Advanced outlier detection method based on
machine learning (ML) can handle correlated multivariate dataset, detect abnor-
malities within them, and do not assume a normal distributions of the features.
Well logs and subsurface measurements are sensing heterogeneous geological
mixtures with a lot of complexity in terms of the distributions of minerals and