Page 160 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 160
3. AI Evaluation 149
as our estimate of the performance on the population of a CI trained with 100% of
the data.
In theory this estimate should be slightly pessimistic because we are training the
CI with only 90% of the data rather than 100%. However, in practice we have regu-
larly seen this method be substantially positively biased for the following reasons:
1. Users perform the entire above cross-validation repeatedly, tweaking the CI each
time until they find a CI architecture that works very well on their particular
data.
2. Users have data with large batch effects, such as the tank example in the previous
section.
3. Some steps of CI training, such as feature selection, were not included in the
cross-validation, such as Method 3 of Table 7.1.
Separate Testing Set: Before starting the training of the CI, we can set aside a
portion of the data on which we will later test our CI, as in Method 1 of Table 7.1.
Developers won’t have access to this testing set, and the application of the CI to the
testing set will only happen once before the results of the test are reported. When
selecting the test set, we should be prepared to handle data that is much different
from the training data. At a minimum they contain different cases. Ideally the cases
in the training and testing sets should have been collected on different days, in
different batches, with different cameras or collection devices. This truly tests the
ability of our CI to generalize to the entire population, and the performance on
the test set will provide an unbiased estimate of the performance on the whole
intended use population. Note that this method of evaluating performance may be
required by regulatory bodies.
3.2 PERFORMANCE MEASURES
There are many possible performance measures to choose from for CI observers.
Recall what our CI observer is doing. From the data for a particular case it computes
a rating t. We want to use this rating to help us discriminate between cases of class A
(called here “abnormal”) and class B (“normal”). A larger value of t should indicate a
higher probability of being abnormal. If t is greater than some threshold T, then we say
the case is “positive” by our CI, otherwise it is “negative.” Canned CI packages may
just return a binary yes/no classification. However, somewhere those algorithms are
making calculations on continuous or ordinal input features, and later make an im-
plicit or explicit binary decision. Our rating t is the underlying value on which this
decision is based. During training, our observer establishes a mapping of t values in
feature space as we showed in Fig. 7.4. The contours of constant t illustrated there
correspond to possible thresholds Tand delineate the corresponding decision surfaces.
We assume that the cases we are using to test the CI were selected at random
from a larger population of cases. The output ratings of the CI are a random sample
from some probability distribution. The probability density of t, p(t), over our
dataset is shown in Fig. 7.7A. The area under the curve p(t) to the right of T is