Page 35 - Machine Learning for Subsurface Characterization
P. 35

20   Machine learning for subsurface characterization


            4.3.4 Dataset #4: Containing manually labeled outliers
            The offshore dataset acquired in Well 2 contains seven log responses from dif-
            ferent lithologies of limestones, sandstone, dolomite, shale, and anhydrites. The
            seven logs are gamma ray (GR), density (DEN), neutron porosity (NEU), com-
            pressional sonic transit time (AC), deep and medium resistivity (RDEP and
            RMED), and photoelectric factor (PEF) logs. The offshore dataset was labeled
            using manual inspection, feature thresholding, and DBSCAN followed by man-
            ual verification of the labels (outliers vs inlier) to create the Dataset #4 for the
            purposes of validation of the four unsupervised ODTs. Construction of Dataset
            #4 required an expert to closely examine the log responses along the entire
            length of Well 2 to manually assign outlier labels to certain depths exhibiting
            anomalous log responses. Manual labels (outlier vs inlier) were assigned after
            analyzing variance of each log and three-dimensional distributions of logs
            acquired in Well 2. Few outliers were identified by first viewing the histogram
            and boxplot for each feature (i.e., log) and then defining the thresholds for each
            feature. The thresholds were determined based on common industry standards
            for determining when the logs are outside their normal ranges. We implemented
            following feature thresholds to determine outliers based on the one-dimensional
            distribution of a log: (1) density correction (DENC) log >0.12 g/cc, (2) photo-
            electric factor (PEF) log >8 B/E, and (3) gamma ray (GR) log >350 gAPI.
               The seven logs from the offshore dataset were also analyzed using three-
            dimensional scatter plots to detect outliers based on the three-dimensional dis-
            tribution of each combination of three logs, one at a time. Seven available logs
                                          7
            in the offshore dataset will have 35 ( C 3 ) possible combinations of three logs.
            Out of the 35 combinations, 7 combinations were analyzed to manually label the
            outliers. DBSCAN was used sequentially on each combination of three logs,
            one combination at a time, to identify the isolated points and clusters that do
            not belong to the dense cluster of normal data. DBSCAN was used as a cluster-
            ing technique to identify noise points and clusters that were labeled as outliers
            because of their location in the low-density region of the feature space. When
            creating the Dataset #4, DBSCAN was used as a clustering technique and not as
            a unsupervised ODT. Dataset #4 was designed as validation set to compare the
            performance of three out of the four unsupervised ODTs, namely isolation for-
            est, OCSVM, and LOF. The seventh subsequent combination of 3 logs provided
            minimal additional outliers to the dataset indicating that most outliers in three-
            dimensional space had already been identified using the 6 prior combinations
            out of the total 35 possible combinations.
               DBSCAN is suited when the dimensionality of the dataset is low. DBSCAN
            has two primary hyperparameters, namely, min_samples and eps, that control
            the detection of outliers. The two hyperparameters of DBSCAN clustering were
            tuned for each combination of three logs through visual analysis of the outliers
            being detected on the scatter plot. This process was continued until the normal
            cluster of data was identified as inliers and all other points were identified as
            outliers. Three of the seven scatterplots used are shown in Fig. 1.5. The blue
   30   31   32   33   34   35   36   37   38   39   40