Page 160 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 160

3. AI Evaluation   149




                  as our estimate of the performance on the population of a CI trained with 100% of
                  the data.
                     In theory this estimate should be slightly pessimistic because we are training the
                  CI with only 90% of the data rather than 100%. However, in practice we have regu-
                  larly seen this method be substantially positively biased for the following reasons:
                  1. Users perform the entire above cross-validation repeatedly, tweaking the CI each
                     time until they find a CI architecture that works very well on their particular
                     data.
                  2. Users have data with large batch effects, such as the tank example in the previous
                     section.
                  3. Some steps of CI training, such as feature selection, were not included in the
                     cross-validation, such as Method 3 of Table 7.1.
                     Separate Testing Set: Before starting the training of the CI, we can set aside a
                  portion of the data on which we will later test our CI, as in Method 1 of Table 7.1.
                  Developers won’t have access to this testing set, and the application of the CI to the
                  testing set will only happen once before the results of the test are reported. When
                  selecting the test set, we should be prepared to handle data that is much different
                  from the training data. At a minimum they contain different cases. Ideally the cases
                  in the training and testing sets should have been collected on different days, in
                  different batches, with different cameras or collection devices. This truly tests the
                  ability of our CI to generalize to the entire population, and the performance on
                  the test set will provide an unbiased estimate of the performance on the whole
                  intended use population. Note that this method of evaluating performance may be
                  required by regulatory bodies.


                  3.2 PERFORMANCE MEASURES
                  There are many possible performance measures to choose from for CI observers.
                  Recall what our CI observer is doing. From the data for a particular case it computes
                  a rating t. We want to use this rating to help us discriminate between cases of class A
                  (called here “abnormal”) and class B (“normal”). A larger value of t should indicate a
                  higher probability of being abnormal. If t is greater than some threshold T, then we say
                  the case is “positive” by our CI, otherwise it is “negative.” Canned CI packages may
                  just return a binary yes/no classification. However, somewhere those algorithms are
                  making calculations on continuous or ordinal input features, and later make an im-
                  plicit or explicit binary decision. Our rating t is the underlying value on which this
                  decision is based. During training, our observer establishes a mapping of t values in
                  feature space as we showed in Fig. 7.4. The contours of constant t illustrated there
                  correspond to possible thresholds Tand delineate the corresponding decision surfaces.
                     We assume that the cases we are using to test the CI were selected at random
                  from a larger population of cases. The output ratings of the CI are a random sample
                  from some probability distribution. The probability density of t, p(t), over our
                  dataset is shown in Fig. 7.7A. The area under the curve p(t) to the right of T is
   155   156   157   158   159   160   161   162   163   164   165