Page 166 - Artificial Intelligence in the Age of Neural Networks and Brain Computing

P. 166

4. Variability and Bias in Our Performance Estimates 155

Specificity (TNF)
1.0 0.8 0.6 0.4 0.2 0.0
1.0 1 0
4 5 6 3 2
● ●
0.8 0.6 ● ● 0 0 8 7
● ●
3 3
9
Sensitivity (TPF) 0.4 ●●●● ● ● ● Biopsy patient?
1 ●
●
● ●
1 4
4
6 6
2 ● 2 ●
7 7
5 8 5 8
X Biopsy patient?
9 9
0.2 ● ● X Any immediate
work up?
X = Reader Number
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Fraction (FPF)
FIGURE 7.11
Sensitivities and speciﬁcities of 10 readers making two types of decisions. The wide line is
a parametric model of their average ROC curve.
based on one set of data used to train and test the CI. If someone else were to attempt
to reproduce our results, they would probably use different training and testing data,
and they would get a performance measure different from ours. Is this
number consistent with our measure of CI performance? Or is their performance
signiﬁcantly different? To make statements of statistical signiﬁcance, we need
to know the statistical uncertainty for these performance measures. Our uncertainty
estimates should reﬂect how much variation there would be if others tried to repro-
duce our results with new data. If we properly calculate our uncertainties, the results
from about 83% of the reproductions of the study will fall within our calculated
95% conﬁdence interval.
Generally we do not have the data, time, or money to reproduce the study repeat-
edly, but we may simulate the reproduction of the experiment and use the variation in
our performance statistic in that simulation as an estimate of our uncertainty. While
we could create simulated data and use that to generate a new CI version and test
it, our simulated data might not accurately reﬂect the true correlations in real data.
Indeed if we knew what all the correlations were, we could write them down and
create an ideal discriminator. Therefore, we often create a “simulation” by resampling
the existing data. This idea is illustrated in Fig. 7.12. Methods of resampling data
include cross-validation, the jackknife, and the bootstrap [19].
In the bootstrap we sample our original data with replacement to create another
sample of the same size [20]. If there were N cases in our original dataset, the
resampled set would also have N cases, but approximately 37% of them would be

161 162 163 164 165 166 167 168 169 170 171