Page 136 - Machine Learning for Subsurface Characterization
P. 136

112 Machine learning for subsurface characterization


            such that n is the number of depth points in the dataset for which DD log syn-
            thesis needs to be performed; j ¼ 1, 2, 3, or 4 indicates the four conductivity-
            dispersion logs, and j ¼ 5, 6, 7, or 8 indicates the four permittivity-dispersion
            logs; s indicates synthesized log response, m indicates measured log response;
            D s,ij is the conductivity σ or relative permittivity ε r log response synthesized for
            the depth i; and D m,ij is the σ or ε r log response measured at depth i. NRMSE for
            log j is then expressed as

                                              RMSE j
                              NRMSE j ¼                                 (4.2)
                                        D m, j,max  D m, j,min
            where subscript min and max indicate the minimum and maximum values of the
            log j, such that j ¼ 1, 2, 3, or 4 indicates the four conductivity-dispersion logs
            and j ¼ 5, 6, 7, or 8 indicates the four permittivity-dispersion logs. In our study,
            high prediction accuracy is indicated by NRMSE less than 0.1. When using
            NRMSE with range as the denominator, it is crucial to remove outliers from
            the dataset.
               When a model generates the targets, the model performance can be repre-
            sented as an error/residual distribution by compiling the errors for all the sam-
                                                              2
            ples in the dataset. A single-valued evaluation metric, like R , MAE, RMSE,
            and NRMSE, condenses the error distribution into a single number and ignores
            a lot of information about the model performance present in the error distribu-
            tion. Single-valued metric provides only one projection of the model errors and,
            therefore, only emphasizes a certain aspect of the error characteristics. When
            evaluating different models, it is important to consider the error distributions
            instead of relying on a single metric. Statistical features of the error distribution,
            such as mean, variance, skewness, and flatness, are needed along with a com-
            bination of single-valued metrics to fully assess the model performances. In
            addition, we should monitor heteroscedasticity of residuals/errors (i.e., differ-
            ence in the scatter of the residuals for different ranges of values of the feature).
            The existence of heteroscedasticity can invalidate statistical tests of signifi-
            cance of the model.



            2.4 Data preprocessing
            In supervised learning, dataset is divided into three parts for purposes of model
            training, testing, and validation. There should not be any common samples
            between the three splits. The testing dataset should be treated like a new dataset
            that should never be used during the model training. Validation data help reduce
            overfitting, but validation data reduce the size of training and testing datasets. In
            this chapter, the dataset is split into two parts, namely, training and testing data-
            sets. Instead of using a validation dataset, we use a regularization term in the
            loss function to reduce overfitting. As a result, more data is available for the
            training and testing stages, which is beneficial for developing data-driven
            models under the constraints of data quantity [10].
   131   132   133   134   135   136   137   138   139   140   141