Page 26 - Machine Learning for Subsurface Characterization
P. 26

Unsupervised outlier detection techniques Chapter  1 11


             The DBSCAN model is very effective in detecting point outliers. It can detect
             collective outliers if they occur as low-density regions. It is not reliable for
             detecting contextual outliers. DBSCAN is not suitable when inliers are distrib-
             uted as low-density regions and requires a lot of expertise in selecting optimal
             hyperparameters that controls the outlier detection. For Fig. 1.1D, DBSCAN
             exhibits a reliable outlier-detection performance unlike OCSVM.


             3.4 Local outlier factor
             Local outlier factor (LOF) is an unsupervised ODT based on relative density of
             region. Simple density-based ODT methods are not as reliable for outlier detec-
             tion when the clusters are of varying densities; for example, inliers can be dis-
             tributed as high-density and low-density regions, and outliers can be distributed
             as high-density region. Local outlier factor mitigates the challenges with
             DBCAN by using relative density as the measure to assign an outlier score
             to each sample. LOF compares the local density of a sample with the local den-
             sities of its k-nearest neighbors to identify outliers, which are in the regions that
             have a substantially lower density than their k-nearest neighbors. LOF assigns a
             score to each sample by computing relative density of each sample as a ratio of
             the average local reachability density of neighbors to the local reachability den-
             sity of the sample and flags the points with low scores as outliers [10]. A sample
             with LOF score of 3 means the average density of this point’s neighbors is about
             three times more than its local density, that is, the sample is not like its neigh-
             bors. LOF score of a sample smaller than 1 indicates the sample has higher den-
             sity than neighbors. The number of neighbors (K) sets how many neighbors are
             considered when computing the LOF score for a sample.
                In Fig. 1.1C, the LOF is applied to the previously mentioned two-
             dimensional dataset containing 25 samples. Samples at the upper right corner
             of Fig. 1.1C are outliers, and the radius of the circle encompassing a sample
             is directly proportional to the LOF score of the sample. LOF score for three
             of those six points at the upper right corner is low; this is odd considering that
             from visual inspection, it is obvious those points are outliers as well. We notice
             that exception because those points are closer to the high-density region, and we
             set the number of neighbors to be considered when calculating relative density
             at K ¼ 20. The density of the samples in the dense region reduces the LOF sam-
             ple score of the three points in the upper right-hand corner. For the dataset con-
             sidered in Fig. 1.1, LOF performance can be improved by reducing the value of
             the hyperparameter K. Like DBSCAN for unsupervised ODT, LOF is severely
             affected by the curse of dimensionality and is computationally intensive when
             there are a large number of samples [11]. Moreover, LOF can be biased because
             a user selects a cutoff for the LOF scores to label the outliers. Selecting the LOF
             score threshold can be inconsistent and biased. For a certain dataset, a score
             greater than 1.2 could represent an outlier, while in another case, the limit could
             be 1.8. LOF needs attentive tuning of the hyperparameters. Due to the local
   21   22   23   24   25   26   27   28   29   30   31