Page 21 - Machine Learning for Subsurface Characterization
P. 21
6 Machine learning for subsurface characterization
fluids; consequently, these measurements generally do not necessarily exhibit
Gaussian distribution and generally exhibit considerable correlations within
the features. Data-driven outlier detection techniques built using machine
learning are more robust in detecting outliers as compared with simple
statistical tools.
Outliers in dataset can be detected using either supervised or unsupervised
ML technique. In supervised ODT, outlier detection is treated as a classifica-
tion problem. The outlier-detection model is trained on dataset with samples
prelabeled as either normal data (inliers) or outliers. The trained model then
assignslabelstothesamplesina new, unseen, unlabeled dataset as either
inliers or outliers basedonwhatwas learnedfromthe training dataset. Super-
vised ODT is robust when the model is exposed to a large, statistically diverse
training set (i.e., dataset that contains every possible instance of normal/inlier
and outlier samples), whose samples are accurately labeled as normal/inlier or
outlier. Unfortunately, this is difficult, time-consuming, and sometimes
impossible to obtain because it requires significant human expertise in label-
ing and expensive data acquisition to obtain a large dataset. In contrary, unsu-
pervised ODT overcomes the requirement of labeled dataset. Unsupervised
ODTs generally assume the following: (1) The number of outliers is much
smaller than the normal samples, and (2) outliers do not follow the overall
“trend” in the dataset. A list of popular outlier detection techniques is listed
in Appendix A.
Both supervised and unsupervised ODTs are used in various industries. For
instance, in credit fraud detection, neural networks are trained on all known
fraudulent and legitimate transactions, and every new transaction is assigned
a fraudulent or legitimate label by the model based on the information from
the training dataset. It could also be trained in an unsupervised manner by flag-
ging transactions that are dissimilar from what is normally encountered. In med-
ical diagnosis, ODTs are used in early detection and diagnosis of certain
diseases by analyzing the patient data (e.g., blood pressure, heart rate, and insu-
lin level) to find patients for whom the measurements deviate significantly from
the normal conditions. Zengyou et al. [2] used a cluster-based local outlier fac-
tor algorithm to detect malignant breast cancer by training their model on fea-
tures related to breast cancer. ODTs are also used in detecting irregularities in
the heart functioning by analyzing the measurements from an echocardiogram
(ECG) for purposes of early diagnosis of certain heart diseases. In the oil and
gas industry, Chaudhary et al. [3] was able to improve the performance of the
stretched exponential production decline (SEPD) model by detecting and
removing outliers from production data by using the local outlier factor method.
In another oil and gas application, Luis et al. [4] used one-class support vector
machine (OCSVM) to detect possible operational issues in offshore turboma-
chinery, such as pumps and compressors, by detecting anomalous signals from
their sensors. When implementing an unsupervised ODT, a prior knowledge of
the expected fraction of outliers improves the accuracy of outlier detection. In