Page 73 -
P. 73
37 How to decide whether to use all your data
Suppose your cat detector’s training set includes 10,000 user-uploaded images. This data
comes from the same distribution as a separate dev/test set, and represents the distribution
you care about doing well on. You also have an additional 20,000 images downloaded from
the internet. Should you provide all 20,000+10,000=30,000 images to your learning
algorithm as its training set, or discard the 20,000 internet images for fear of it biasing your
learning algorithm?
When using earlier generations of learning algorithms (such as hand-designed computer
vision features, followed by a simple linear classifier) there was a real risk that merging both
types of data would cause you to perform worse. Thus, some engineers will warn you against
including the 20,000 internet images.
But in the modern era of powerful, flexible learning algorithms—such as large neural
networks—this risk has greatly diminished. If you can afford to build a neural network with a
large enough number of hidden units/layers, you can safely add the 20,000 images to your
training set. Adding the images is more likely to increase your performance.
This observation relies on the fact that there is some x —> y mapping that works well for
both types of data. In other words, there exists some system that inputs either an internet
image or a mobile app image and reliably predicts the label, even without knowing the
source of the image.
Adding the additional 20,000 images has the following effects:
1. It gives your neural network more examples of what cats do/do not look like. This is
helpful, since internet images and user-uploaded mobile app images do share some
similarities. Your neural network can apply some of the knowledge acquired from internet
images to mobile app images.
2. It forces the neural network to expend some of its capacity to learn about properties that
are specific to internet images (such as higher resolution, different distributions of how
the images are framed, etc.) If these properties differ greatly from mobile app images, it
will “use up” some of the representational capacity of the neural network. Thus there is
less capacity for recognizing data drawn from the distribution of mobile app images,
which is what you really care about. Theoretically, this could hurt your algorithms’
performance.
Page 73 Machine Learning Yearning-Draft Andrew Ng