Page 168 - Machine Learning for Subsurface Characterization
P. 168
142 Machine learning for subsurface characterization
centers are shifted till the distortion/inertia metric converges, that is, with
further iteration to find the best clusters, the cluster centers do not shift a lot.
Gaussian mixture model (GMM) assumes the clusters in the dataset are
generated based on Gaussian processes. The data points in the
multidimensional feature space are fitted to multivariate normal distributions
that maximize the posterior probability of the distribution given the data.
Hierarchical clustering model clusters dataset by repeatedly merging
(agglomerative) or splitting (divisive) data based on certain similarities to
generate a hierarchy of clusters. For example, agglomerative hierarchical
clustering using Euclidian distance as a measure of similarity repeatedly
executes the following two steps: (1) Identify the two clusters that are closest
to each other, and (2) merge the two closest clusters with an assumption that
the proximity of clusters indicates similarity of the clusters. This continues
until all the clusters are merged together. DBSCAN clusters the data points
based on the density of the data points. The algorithm puts data points with a
lot of neighbors into similar groups and recognizes points with fewer
neighbors as outliers. DBSCAN needs the user to define the minimum number
of points that are required to form the cluster and the maximum distance
between two points required for the two points to be part of the same cluster.
The fifth clustering technique, SOM, utilizes neural network for unsupervised
dimensionality reduction by projecting the high-dimensional data onto two-
dimensional space while maintaining the original similarity between the data
points. Here, we first apply SOM projection, then use K-means to cluster the
dimensionality-reduced data in the lower-dimensional feature space into groups.
We first applied the five clustering techniques on all the “easy-to-acquire”
logs (features). The clusters so obtained did not exhibit any correlation with
the performances of the shallow-learning regression models for the synthesis
of DTS and DTC logs. Clustering methods that use Euclidian distance, for
example, K-means and DBSCAN, perform poorly in high-dimensional feature
space due to the curse of dimensionality. High dimensionality and high
nonlinearity when using all the 13 “easy-to-acquire” logs resulted in complex
relationships among the features that were challenging for the clustering
algorithms to resolve into reliable clusters. In order to avoid the curse of
dimensionality, only three “easy-to-acquire” logs, namely, DPHZ, NPOR, and
RHOZ, were used for the desired clustering because these logs exhibit good
correlations with the log synthesis performance of the shallow-learning models
(Fig. 5.3). We chose these three logs to build the clusters for determining the
reliability of log synthesis using the shallow-learning models in new wells. For
the five clustering techniques, we processed the three selected features,
namely, DPHZ, NPOR, and RHOZ, to generate only three clusters that could
show potential correlation with the good, intermediate, and bad log-synthesis
performances, respectively, of the shallow-learning models. The 4240-ft
formation in Well 1 is clustered into three clusters by processing the DPHZ,
NPOR, and RHOZ logs. Following that, the averaged cluster numbers of each
50-ft depth interval and the averaged relative errors in log synthesis for each