Page 158 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 158
2. AI Development 147
Table 7.1 The Measured AUC Performance of Four Different Methods of
Developing Our CI [9]
Method 1 Method 2 Method 3 Method 4
Features selected on Training set Known Entire set Entire set
CI trained on Training set Training set Training set Entire set
CI tested on Testing set Testing set Testing set Entire set
Tested AUC 0.52 0.63 0.83 0.91
We cheated in our feature selection, and we will pay for it. Actually we may not
pay for it at all. Our cheating will give us better (more publishable) results, we will
be much sought after and awarded tenure at an early age; whereas our more honest
colleagues will have worse results and end their lives flipping burgers. No, no, forget
that, and let’s try to clean up our act. We cheated by using cases for feature selection
that we later very fastidiously separated into sets for training and for testing. This
is a common failing. Very frequently the entire dataset is used during the feature
selection stage as though this were not an integral part of training [10]. Table 7.1
illustrates how this can lead to a very significant but false increase in perceived
performance.
In Table 7.1 the true ideal AUC value is 0.70, due to 30 truly useful features out
of 1000. Method 1, which is honest CI development, is overwhelmed by the 970
noisy features due to a limited training set. Knowing the truly useful features a priori
(method 2) is helpful, but variability in the training and testing datasets still limits CI
performance. Using the entire dataset (Training and Testing Datasets Together) for
either the feature selection or training (methods 3, 4) yields significant positive bias
and an undeserved paper in Nature.
Our architecture is wrong. It is too complicated for the task and for the data we
have. We want perfection, but have yet to accept the fact that perfection is evil. We
should want generalizability. It is always possible to achieve perfection on the
training set: just follow Tom Cover [8] who demonstrated that we can increase
the size of the feature space, until any small child could pass a (hyper)plane
through it, separating the two classes perfectly. Or, we could equivalently encase
each type A case in a sphere of vanishingly small radius, for example by letting
a smoothness parameter go to zero as in Fig. 7.4D. Once again we would have per-
fect separation in the training set with the added benefit of having produced a clas-
sifier with perfect specificity, but unfortunately, zero sensitivity for the entire
universe of cases outside of the training set. Very complex algorithms with large
features spaces are prone to failure when presented with cases that were outside
their training set. For example, Su, Vargas, and Kouichi show that changing a single
pixel in an image can completely change the discrimination of a complex CI such as a
deep neural network [11].