Page 163 - Artificial Intelligence in the Age of Neural Networks and Brain Computing

P. 163

152 CHAPTER 7 Pitfalls and Opportunities in the Development of AI Systems

FIGURE 7.9
This ﬁgure gives threshold dependent performance measures TPF, TNF, PPV, NPV, and
accuracy, at all thresholds for a small dataset.

true population in which it will be used. For example, if we test our above CI on data
seeded with extra cases of a rare serious disease, then measured accuracy, PPV, and
NPV will be meaningless for the actual low prevalence population. Furthermore, as
we show later, the optimal classiﬁer decision threshold (T value) usually does not
correspond to the one yielding maximum accuracy.
Consider the CI from Fig. 7.2 designed to discriminate between two classes of
patients, abnormal (shown upside down) and normal (shown right side up). In
Fig. 7.9 we order those patients using the ratings that were assigned by the CI.
Ideally every truly abnormal patient would have been given a rating higher than
every normal patient, and we could assign every abnormal patient as positive, and
every normal patient as negative. However, due to our imperfect CI, or perhaps
due to the noisy images themselves, the normal and abnormal patients are not
perfectly separable given the ratings.
Now what happens if we change our threshold T on the CI rating? Of course
we can calculate TPF, TNF, PPV, NPV, and accuracy for any decision threshold.
For example, in Fig. 7.9 if we use threshold T 5 and declare that all patients with
a CI rating greater than 4.5 tested positive, then 4/5 of the diseased patients
will be correctly declared positive (TPF ¼ 80%), 5/6 of the nondiseased patients
will be correctly declared negative (TNF ¼ 83%), 4/5 of the patients that we called
positive really have disease (PPV ¼ 80%), 5/6 of the patients that we called negative
are truly normal (NPV ¼ 83%), and 9/11 of the patients were correctly assigned
(accuracy ¼ 82%).
Note that all of these measures change when we change our decision threshold
for testing positive. For example if we use threshold T 3 , and declare that all patients
with a CI rating greater than 2.5 tested positive, all the above measures will be
different. While a decision threshold may be important for evaluating the utility
of a CI in a particular scenario, usually when comparing the ability of CIs to separate
two classes we prefer measures of performance that are not dependent upon a

158 159 160 161 162 163 164 165 166 167 168