Page 158 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 158

2. AI Development     147




                  Table 7.1 The Measured AUC Performance of Four Different Methods of
                  Developing Our CI [9]
                                     Method 1    Method 2    Method 3    Method 4
                   Features selected on  Training set  Known  Entire set  Entire set
                   CI trained on     Training set  Training set  Training set  Entire set
                   CI tested on      Testing set  Testing set  Testing set  Entire set
                   Tested AUC        0.52        0.63        0.83        0.91




                     We cheated in our feature selection, and we will pay for it. Actually we may not
                  pay for it at all. Our cheating will give us better (more publishable) results, we will
                  be much sought after and awarded tenure at an early age; whereas our more honest
                  colleagues will have worse results and end their lives flipping burgers. No, no, forget
                  that, and let’s try to clean up our act. We cheated by using cases for feature selection
                  that we later very fastidiously separated into sets for training and for testing. This
                  is a common failing. Very frequently the entire dataset is used during the feature
                  selection stage as though this were not an integral part of training [10]. Table 7.1
                  illustrates how this can lead to a very significant but false increase in perceived
                  performance.
                     In Table 7.1 the true ideal AUC value is 0.70, due to 30 truly useful features out
                  of 1000. Method 1, which is honest CI development, is overwhelmed by the 970
                  noisy features due to a limited training set. Knowing the truly useful features a priori
                  (method 2) is helpful, but variability in the training and testing datasets still limits CI
                  performance. Using the entire dataset (Training and Testing Datasets Together) for
                  either the feature selection or training (methods 3, 4) yields significant positive bias
                  and an undeserved paper in Nature.
                     Our architecture is wrong. It is too complicated for the task and for the data we
                  have. We want perfection, but have yet to accept the fact that perfection is evil. We
                  should want generalizability. It is always possible to achieve perfection on the
                  training set: just follow Tom Cover [8] who demonstrated that we can increase
                  the size of the feature space, until any small child could pass a (hyper)plane
                  through it, separating the two classes perfectly. Or, we could equivalently encase
                  each type A case in a sphere of vanishingly small radius, for example by letting
                  a smoothness parameter go to zero as in Fig. 7.4D. Once again we would have per-
                  fect separation in the training set with the added benefit of having produced a clas-
                  sifier with perfect specificity, but unfortunately, zero sensitivity for the entire
                  universe of cases outside of the training set. Very complex algorithms with large
                  features spaces are prone to failure when presented with cases that were outside
                  their training set. For example, Su, Vargas, and Kouichi show that changing a single
                  pixel in an image can completely change the discrimination of a complex CI such as a
                  deep neural network [11].
   153   154   155   156   157   158   159   160   161   162   163