Page 40 - Machine Learning for Subsurface Characterization
P. 40

Unsupervised outlier detection techniques Chapter  1 25


















             FIG. 1.6 Examples of (A) ROC curve and (B) PR curve of a classifier for demonstration purpose.
             Area under the blue curves represent the AUC.


             exhibiting a gradient close to 1 and ROC-AUC of 0.5 (red-dotted line (light
             gray in the print version) in Fig. 1.6A; referred as the no-skill line) indicates
             that the unsupervised ODT is performing only as good as randomly selecting
             certain samples as outliers. A high ROC-AUC score close to 1 indicates that
             large portion of actual outliers and inliers will be correctly detected without
             much sensitivity to the decision thresholds of the method. DBSCAN is not
             designed for supervised tasks; therefore, there is no inbuilt functionality to gen-
             erate the ROC curve and the ROC-AUC score. ROC curve should be used when
             numbers of outliers and inliers are nearly equal without any major imbalance in
             the dataset. ROC curves for various unsupervised methods on Dataset #1 are
             shown in Figs. 1.D1–1.D3 in Appendix D.

             4.4.7 Precision-recall (PR) curve and PR-AUC score
             Like the ROC curve, the precision-recall (PR) curve is a plot of the precision vs
             recall of the unsupervised ODT on a dataset for various decision thresholds. Per-
             formance of the unsupervised ODT is considered to be excellent when the
             detection has high precision and high recall irrespective of the choice of deci-
             sion threshold, that is, the PR curve shifts toward the top right corner in the plot
             away from the red-dotted line (light gray in the print version) shown in
             Fig. 1.6B. The red-dotted line (light gray in the print version) is referred as
             the no-skill line, which is defined by the total number of original positives
             (i.e., true outliers) divided by the total number of original outliers and original
             inliers (i.e., total number of sample). The performance demonstrated in
             Fig. 1.6B is a rather poor performance. AUC of the PR curve is used as a mea-
             sure of the ODT model performance with an AUC of 1 indicating a robust and
             reliable outlier detection. A good PR curve should be away from the baseline
             and toward the left top corner in the plot shown in Fig. 1.6B. PR curve should be
             used when numbers of outliers and inliers are very different resulting in signif-
             icant imbalance in the dataset. PR curves for various unsupervised methods on
             Dataset #1 are shown in Figs. 1.D1–1.D3 in Appendix D.
   35   36   37   38   39   40   41   42   43   44   45