Page 165 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 165
154 CHAPTER 7 Pitfalls and Opportunities in the Development of AI Systems
3.3 DECISION THRESHOLDS
In its final implementation we have to choose a decision threshold T for our CI or
equivalently some operating point on our ROC curve. Above what rating value should
our CI tell its users that a case is positive? Frequently CI developers choose a
threshold that maximizes the accuracy of the CI on some test set. For example,
such a developer would choose threshold T 5 (t ¼ 4.5) in Fig. 7.9. This choice max-
imizes the number of correct calls by the CI on our dataset, but it makes two dubious
assumptions. It assumes that the prevalence of abnormal cases in our test sample is
the same as for the population on which the CI will be implemented, and it assumes
that all correct/incorrect decisions have the same benefits/costs. If calling an
abnormal case positive (true positive) has a greater benefit or utility than calling a
normal case negative, then maximizing accuracy is the wrong choice to make. For
example, by studying the decisions of radiologists, Abbey et al. [14] estimated that
the benefit of a true positive decision in breast cancer screening is about 162 times
greater than the benefit of a true negative, and therefore the false positive fraction
in screening is about 450 times larger than that which would maximize accuracy.
The choice of threshold T should be the one that maximizes the expected utility
of all the decisions that a CI will make, where the utility is the sum of the benefits
from true results minus the costs of false ones [15,16]. Different decision thresh-
olds yield different numbers of true positives and true negatives, as in Fig. 7.9,
and therefore different total expected utilities. While these expected utilities are
difficult to calculate with any degree of accuracy, experiments show that all of
us set decision thresholds in everyday practice as though we are attempting to
maximize the benefits and minimize costs of our decisions.
As an example of decision thresholds and utilities in real-world practice, we
give data from Elmore et al. [17]. The circles in Fig. 7.11 show the sensitivities
and specificities of 10 HI (human intelligence) readers when deciding if 150 pa-
tients should be sent to biopsy. The diamonds in the plot show how the same 10
readers decided whether to send the same 150 patients to different diagnostics.
Note that all these decisions are consistent with a single ROC curve, which is
modeled by the wide solid line. All the readers processed the same image data,
and all the readers had roughly the same AUC and the same ability to separate
normal patients from abnormal patients. However, many of these decisions were
made at very different decision thresholds. Different readers operated at different
decision thresholds. All readers attempted to maximize the utility of their decisions
by having a higher threshold for sending people to biopsy, because the cost of a
biopsy is higher than for other diagnostics [18].
4. VARIABILITY AND BIAS IN OUR PERFORMANCE
ESTIMATES
It is important that we report a measure of uncertainty or provide confidence
intervals on any performance measurement that we report. Our measurement is