Page 136 - Machine Learning for Subsurface Characterization
P. 136
112 Machine learning for subsurface characterization
such that n is the number of depth points in the dataset for which DD log syn-
thesis needs to be performed; j ¼ 1, 2, 3, or 4 indicates the four conductivity-
dispersion logs, and j ¼ 5, 6, 7, or 8 indicates the four permittivity-dispersion
logs; s indicates synthesized log response, m indicates measured log response;
D s,ij is the conductivity σ or relative permittivity ε r log response synthesized for
the depth i; and D m,ij is the σ or ε r log response measured at depth i. NRMSE for
log j is then expressed as
RMSE j
NRMSE j ¼ (4.2)
D m, j,max D m, j,min
where subscript min and max indicate the minimum and maximum values of the
log j, such that j ¼ 1, 2, 3, or 4 indicates the four conductivity-dispersion logs
and j ¼ 5, 6, 7, or 8 indicates the four permittivity-dispersion logs. In our study,
high prediction accuracy is indicated by NRMSE less than 0.1. When using
NRMSE with range as the denominator, it is crucial to remove outliers from
the dataset.
When a model generates the targets, the model performance can be repre-
sented as an error/residual distribution by compiling the errors for all the sam-
2
ples in the dataset. A single-valued evaluation metric, like R , MAE, RMSE,
and NRMSE, condenses the error distribution into a single number and ignores
a lot of information about the model performance present in the error distribu-
tion. Single-valued metric provides only one projection of the model errors and,
therefore, only emphasizes a certain aspect of the error characteristics. When
evaluating different models, it is important to consider the error distributions
instead of relying on a single metric. Statistical features of the error distribution,
such as mean, variance, skewness, and flatness, are needed along with a com-
bination of single-valued metrics to fully assess the model performances. In
addition, we should monitor heteroscedasticity of residuals/errors (i.e., differ-
ence in the scatter of the residuals for different ranges of values of the feature).
The existence of heteroscedasticity can invalidate statistical tests of signifi-
cance of the model.
2.4 Data preprocessing
In supervised learning, dataset is divided into three parts for purposes of model
training, testing, and validation. There should not be any common samples
between the three splits. The testing dataset should be treated like a new dataset
that should never be used during the model training. Validation data help reduce
overfitting, but validation data reduce the size of training and testing datasets. In
this chapter, the dataset is split into two parts, namely, training and testing data-
sets. Instead of using a validation dataset, we use a regularization term in the
loss function to reduce overfitting. As a result, more data is available for the
training and testing stages, which is beneficial for developing data-driven
models under the constraints of data quantity [10].