Page 24 - Machine Learning for Subsurface Characterization
P. 24

Unsupervised outlier detection techniques Chapter  1 9


































             FIG. 1.1 Performances of the four unsupervised outlier detection techniques, namely, (A) isolation
             forest with hyperparameters: max_samples ¼ 10, n_estimators ¼ 100, max_features ¼ 2, and con-
             tamination ¼ “auto”; (B) one-class SVM with hyperparameters: nu ¼ 0.5 and gamma ¼ 0.04; (C)
             local outlier factor with hyperparameters: n_neighbors ¼ 20, metric ¼ “minkowiski,” and p ¼ 2;
             and (D) DBSCAN with hyperparameters: eps ¼ 0.5, min_samples ¼ 5, metric ¼ “minkowiski,”
             and p ¼ 2, on the synthetic two-dimensional dataset containing 25 samples. Red samples (light gray
             intheprintversion)indicateoutliers,andbluesamples(darkgrayintheprintversion)indicateinliers.
             All other hyperparameters except those mentioned earlier have default values.
             few outliers (minimally contaminated). OCSVM builds a representational
             model of normality (inliers) by processing the dataset, wherein most of the sam-
             ples are considered as inliers. OCSVM is based on the support vector machine
             that finds the support vectors and then separates the data into separate classes
             using hyperplanes/hyperspheres. OCSVM finds a minimal hypersphere in the
             kernel space (transformed feature space) that circumscribes maximum inliers
             (normal samples). The hypersphere determines the normality in the dataset.
             OCSVM nonlinearly projects the data into a high-dimensional kernel space
             and then maximally separates the data from the origin of the kernel space by
             finding an optimal hypersphere. As a result, OCSVM may be viewed as a reg-
             ular two-class SVM where most of the training data (i.e., inliers) lie in the first
             class, and the origin is taken as a dominant member of the second class contain-
             ing the outliers. Nonetheless, there is a trade-off between maximizing the dis-
             tance of the hyperplane from the origin and the number of training data points
             contained in the hypersphere (region separated from the origin by the
   19   20   21   22   23   24   25   26   27   28   29