Page 20 - Machine Learning for Subsurface Characterization
P. 20

Unsupervised outlier detection techniques Chapter  1 5


             features. These outliers exist at the tail end of a distribution and largely vary from
             the mean of the distribution, generally lying beyond 2 standard deviations away
             from the mean; for example, subsurface depths where porosity is >40 porosity
             units or permeability is >5 Darcy should be considered as point/global outliers.
             From an event perspective, a house getting hit by a meteorite is an example of
             point outlier. The second category of outliers is the contextual/conditional out-
             liers, which deviate significantly from the data points within a specific context,
             for example, a large gamma ray reading in sandstone due to an increase in
             potassium-rich minerals (feldspar). Snow in summer is an example of contextual
             outlier. Points labeled as contextual outliers are valid outliers only for a specific
             context; a change in the context will result in a similar point to be considered as an
             inlier. Collective outliers are a small cluster of data that as a whole deviate sig-
             nificantly from the entire dataset, for example, log measurements from regions
             affected by borehole washout. For example, it is not rare that people move from
             one residence to the next; however, when an entire neighborhood relocates at the
             same time, it will be considered as collective outlier. Contextual and collective
             outliers need a domain expert to guide the outlier detection.


             2  Outlier detection techniques
             An outlier detection technique (ODT) is used to detect anomalous observations/
             samples that do not fit the typical/normal statistical distribution of a dataset.
             Simple methods for outlier detection use statistical tools, such as boxplot and
             Z-score, on each individual feature of the dataset. A boxplot is a standardized
             way of representing the distributions of samples corresponding to various fea-
             tures using boxes and whiskers. The boxes represent the interquartile range of
             the data, and the whiskers represent a multiple of the first and third quartiles of
             the variable; any data point/sample outside these limits is considered as an out-
             lier. The next simple statistical tool for feature-specific outlier detection is the
             Z-score, which indicates how far the value of the data point/sample is from its
             mean for a specific feature. A Z-score of 1 means the sample point is 1 standard
             deviation away from its mean. Typically, Z-score values greater than or less
             than +3 or  3, respectively, are considered outliers. Z-score is expressed as
                                               x i  x
                                     Z  score ¼                         (1.1)
                                                 σ
             where σ and x is the standard deviation and mean of the distribution of feature x,
             respectively, and x i is the value of the feature x for the ith sample.
                Outlier detection based on simple statistical tools generally assume that the
             features have normal distributions while neglecting the correlation between fea-
             tures in a multivariate dataset. Advanced outlier detection method based on
             machine learning (ML) can handle correlated multivariate dataset, detect abnor-
             malities within them, and do not assume a normal distributions of the features.
             Well logs and subsurface measurements are sensing heterogeneous geological
             mixtures with a lot of complexity in terms of the distributions of minerals and
   15   16   17   18   19   20   21   22   23   24   25