Page 26 - Machine Learning for Subsurface Characterization
P. 26
Unsupervised outlier detection techniques Chapter 1 11
The DBSCAN model is very effective in detecting point outliers. It can detect
collective outliers if they occur as low-density regions. It is not reliable for
detecting contextual outliers. DBSCAN is not suitable when inliers are distrib-
uted as low-density regions and requires a lot of expertise in selecting optimal
hyperparameters that controls the outlier detection. For Fig. 1.1D, DBSCAN
exhibits a reliable outlier-detection performance unlike OCSVM.
3.4 Local outlier factor
Local outlier factor (LOF) is an unsupervised ODT based on relative density of
region. Simple density-based ODT methods are not as reliable for outlier detec-
tion when the clusters are of varying densities; for example, inliers can be dis-
tributed as high-density and low-density regions, and outliers can be distributed
as high-density region. Local outlier factor mitigates the challenges with
DBCAN by using relative density as the measure to assign an outlier score
to each sample. LOF compares the local density of a sample with the local den-
sities of its k-nearest neighbors to identify outliers, which are in the regions that
have a substantially lower density than their k-nearest neighbors. LOF assigns a
score to each sample by computing relative density of each sample as a ratio of
the average local reachability density of neighbors to the local reachability den-
sity of the sample and flags the points with low scores as outliers [10]. A sample
with LOF score of 3 means the average density of this point’s neighbors is about
three times more than its local density, that is, the sample is not like its neigh-
bors. LOF score of a sample smaller than 1 indicates the sample has higher den-
sity than neighbors. The number of neighbors (K) sets how many neighbors are
considered when computing the LOF score for a sample.
In Fig. 1.1C, the LOF is applied to the previously mentioned two-
dimensional dataset containing 25 samples. Samples at the upper right corner
of Fig. 1.1C are outliers, and the radius of the circle encompassing a sample
is directly proportional to the LOF score of the sample. LOF score for three
of those six points at the upper right corner is low; this is odd considering that
from visual inspection, it is obvious those points are outliers as well. We notice
that exception because those points are closer to the high-density region, and we
set the number of neighbors to be considered when calculating relative density
at K ¼ 20. The density of the samples in the dense region reduces the LOF sam-
ple score of the three points in the upper right-hand corner. For the dataset con-
sidered in Fig. 1.1, LOF performance can be improved by reducing the value of
the hyperparameter K. Like DBSCAN for unsupervised ODT, LOF is severely
affected by the curse of dimensionality and is computationally intensive when
there are a large number of samples [11]. Moreover, LOF can be biased because
a user selects a cutoff for the LOF scores to label the outliers. Selecting the LOF
score threshold can be inconsistent and biased. For a certain dataset, a score
greater than 1.2 could represent an outlier, while in another case, the limit could
be 1.8. LOF needs attentive tuning of the hyperparameters. Due to the local