Page 35 - Machine Learning for Subsurface Characterization
P. 35
20 Machine learning for subsurface characterization
4.3.4 Dataset #4: Containing manually labeled outliers
The offshore dataset acquired in Well 2 contains seven log responses from dif-
ferent lithologies of limestones, sandstone, dolomite, shale, and anhydrites. The
seven logs are gamma ray (GR), density (DEN), neutron porosity (NEU), com-
pressional sonic transit time (AC), deep and medium resistivity (RDEP and
RMED), and photoelectric factor (PEF) logs. The offshore dataset was labeled
using manual inspection, feature thresholding, and DBSCAN followed by man-
ual verification of the labels (outliers vs inlier) to create the Dataset #4 for the
purposes of validation of the four unsupervised ODTs. Construction of Dataset
#4 required an expert to closely examine the log responses along the entire
length of Well 2 to manually assign outlier labels to certain depths exhibiting
anomalous log responses. Manual labels (outlier vs inlier) were assigned after
analyzing variance of each log and three-dimensional distributions of logs
acquired in Well 2. Few outliers were identified by first viewing the histogram
and boxplot for each feature (i.e., log) and then defining the thresholds for each
feature. The thresholds were determined based on common industry standards
for determining when the logs are outside their normal ranges. We implemented
following feature thresholds to determine outliers based on the one-dimensional
distribution of a log: (1) density correction (DENC) log >0.12 g/cc, (2) photo-
electric factor (PEF) log >8 B/E, and (3) gamma ray (GR) log >350 gAPI.
The seven logs from the offshore dataset were also analyzed using three-
dimensional scatter plots to detect outliers based on the three-dimensional dis-
tribution of each combination of three logs, one at a time. Seven available logs
7
in the offshore dataset will have 35 ( C 3 ) possible combinations of three logs.
Out of the 35 combinations, 7 combinations were analyzed to manually label the
outliers. DBSCAN was used sequentially on each combination of three logs,
one combination at a time, to identify the isolated points and clusters that do
not belong to the dense cluster of normal data. DBSCAN was used as a cluster-
ing technique to identify noise points and clusters that were labeled as outliers
because of their location in the low-density region of the feature space. When
creating the Dataset #4, DBSCAN was used as a clustering technique and not as
a unsupervised ODT. Dataset #4 was designed as validation set to compare the
performance of three out of the four unsupervised ODTs, namely isolation for-
est, OCSVM, and LOF. The seventh subsequent combination of 3 logs provided
minimal additional outliers to the dataset indicating that most outliers in three-
dimensional space had already been identified using the 6 prior combinations
out of the total 35 possible combinations.
DBSCAN is suited when the dimensionality of the dataset is low. DBSCAN
has two primary hyperparameters, namely, min_samples and eps, that control
the detection of outliers. The two hyperparameters of DBSCAN clustering were
tuned for each combination of three logs through visual analysis of the outliers
being detected on the scatter plot. This process was continued until the normal
cluster of data was identified as inliers and all other points were identified as
outliers. Three of the seven scatterplots used are shown in Fig. 1.5. The blue