Page 166 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 166

4. Variability and Bias in Our Performance Estimates  155




                                              Specificity (TNF)
                                      1.0   0.8   0.6  0.4   0.2  0.0
                                     1.0          1      0
                                                   4 5 6  3  2
                                            ● ●
                                     0.8  0.6 ● ● 0 0  8  7
                                          ● ●
                                           3 3
                                          9
                                   Sensitivity (TPF)  0.4 ●●●● ● ●  ●  Biopsy patient?
                                         1 ●
                                          ●
                                        ● ●
                                         1 4
                                           4
                                        6 6
                                          2 ● 2 ●
                                          7 7
                                        5 8 5 8
                                                X Biopsy patient?
                                        9 9
                                     0.2  ● ●   X  Any immediate
                                                   work up?
                                                X  = Reader Number
                                     0.0
                                      0.0   0.2   0.4  0.6   0.8  1.0
                                         False Positive Fraction (FPF)
                  FIGURE 7.11
                  Sensitivities and specificities of 10 readers making two types of decisions. The wide line is
                  a parametric model of their average ROC curve.
                  based on one set of data used to train and test the CI. If someone else were to attempt
                  to reproduce our results, they would probably use different training and testing data,
                  and they would get a performance measure different from ours. Is this
                  number consistent with our measure of CI performance? Or is their performance
                  significantly different? To make statements of statistical significance, we need
                  to know the statistical uncertainty for these performance measures. Our uncertainty
                  estimates should reflect how much variation there would be if others tried to repro-
                  duce our results with new data. If we properly calculate our uncertainties, the results
                  from about 83% of the reproductions of the study will fall within our calculated
                  95% confidence interval.
                     Generally we do not have the data, time, or money to reproduce the study repeat-
                  edly, but we may simulate the reproduction of the experiment and use the variation in
                  our performance statistic in that simulation as an estimate of our uncertainty. While
                  we could create simulated data and use that to generate a new CI version and test
                  it, our simulated data might not accurately reflect the true correlations in real data.
                  Indeed if we knew what all the correlations were, we could write them down and
                  create an ideal discriminator. Therefore, we often create a “simulation” by resampling
                  the existing data. This idea is illustrated in Fig. 7.12. Methods of resampling data
                  include cross-validation, the jackknife, and the bootstrap [19].
                     In the bootstrap we sample our original data with replacement to create another
                  sample of the same size [20]. If there were N cases in our original dataset, the
                  resampled set would also have N cases, but approximately 37% of them would be
   161   162   163   164   165   166   167   168   169   170   171