Page 73 -
P. 73

37 How to decide whether to use all your data



             Suppose your cat detector’s training set includes 10,000 user-uploaded images. This data
             comes from the same distribution as a separate dev/test set, and represents the distribution
             you care about doing well on. You also have an additional 20,000 images downloaded from
             the internet. Should you provide all 20,000+10,000=30,000 images to your learning
             algorithm as its training set, or discard the 20,000 internet images for fear of it biasing your
             learning algorithm?


             When using earlier generations of learning algorithms (such as hand-designed computer
             vision features, followed by a simple linear classifier) there was a real risk that merging both
             types of data would cause you to perform worse. Thus, some engineers will warn you against
             including the 20,000 internet images.


             But in the modern era of powerful, flexible learning algorithms—such as large neural
             networks—this risk has greatly diminished. If you can afford to build a neural network with a
             large enough number of hidden units/layers, you can safely add the 20,000 images to your
             training set. Adding the images is more likely to increase your performance.

             This observation relies on the fact that there is some x —> y mapping that works well for
             both types of data. In other words, there exists some system that inputs either an internet

             image or a mobile app image and reliably predicts the label, even without knowing the
             source of the image.

             Adding the additional 20,000 images has the following effects:

             1. It gives your neural network more examples of what cats do/do not look like. This is
                helpful, since internet images and user-uploaded mobile app images do share some

                similarities. Your neural network can apply some of the knowledge acquired from internet
                images to mobile app images.

             2. It forces the neural network to expend some of its capacity to learn about properties that
                are specific to internet images (such as higher resolution, different distributions of how
                the images are framed, etc.) If these properties differ greatly from mobile app images, it
                will “use up” some of the representational capacity of the neural network. Thus there is

                less capacity for recognizing data drawn from the distribution of mobile app images,
                which is what you really care about. Theoretically, this could hurt your algorithms’
                performance.






             Page 73                            Machine Learning Yearning-Draft                       Andrew Ng
   68   69   70   71   72   73   74   75   76   77   78