Page 165 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 165

154    CHAPTER 7 Pitfalls and Opportunities in the Development of AI Systems




                         3.3 DECISION THRESHOLDS
                         In its final implementation we have to choose a decision threshold T for our CI or
                         equivalently some operating point on our ROC curve. Above what rating value should
                         our CI tell its users that a case is positive? Frequently CI developers choose a
                         threshold that maximizes the accuracy of the CI on some test set. For example,
                         such a developer would choose threshold T 5 (t ¼ 4.5) in Fig. 7.9. This choice max-
                         imizes the number of correct calls by the CI on our dataset, but it makes two dubious
                         assumptions. It assumes that the prevalence of abnormal cases in our test sample is
                         the same as for the population on which the CI will be implemented, and it assumes
                         that all correct/incorrect decisions have the same benefits/costs. If calling an
                         abnormal case positive (true positive) has a greater benefit or utility than calling a
                         normal case negative, then maximizing accuracy is the wrong choice to make. For
                         example, by studying the decisions of radiologists, Abbey et al. [14] estimated that
                         the benefit of a true positive decision in breast cancer screening is about 162 times
                         greater than the benefit of a true negative, and therefore the false positive fraction
                         in screening is about 450 times larger than that which would maximize accuracy.
                            The choice of threshold T should be the one that maximizes the expected utility
                         of all the decisions that a CI will make, where the utility is the sum of the benefits
                         from true results minus the costs of false ones [15,16]. Different decision thresh-
                         olds yield different numbers of true positives and true negatives, as in Fig. 7.9,
                         and therefore different total expected utilities. While these expected utilities are
                         difficult to calculate with any degree of accuracy, experiments show that all of
                         us set decision thresholds in everyday practice as though we are attempting to
                         maximize the benefits and minimize costs of our decisions.
                            As an example of decision thresholds and utilities in real-world practice, we
                         give data from Elmore et al. [17]. The circles in Fig. 7.11 show the sensitivities
                         and specificities of 10 HI (human intelligence) readers when deciding if 150 pa-
                         tients should be sent to biopsy. The diamonds in the plot show how the same 10
                         readers decided whether to send the same 150 patients to different diagnostics.
                         Note that all these decisions are consistent with a single ROC curve, which is
                         modeled by the wide solid line. All the readers processed the same image data,
                         and all the readers had roughly the same AUC and the same ability to separate
                         normal patients from abnormal patients. However, many of these decisions were
                         made at very different decision thresholds. Different readers operated at different
                         decision thresholds. All readers attempted to maximize the utility of their decisions
                         by having a higher threshold for sending people to biopsy, because the cost of a
                         biopsy is higher than for other diagnostics [18].


                         4. VARIABILITY AND BIAS IN OUR PERFORMANCE
                            ESTIMATES

                         It is important that we report a measure of uncertainty or provide confidence
                         intervals on any performance measurement that we report. Our measurement is
   160   161   162   163   164   165   166   167   168   169   170