Page 71 -
P. 71

36 When you should train and test on

             different distributions



             Users of your cat pictures app have uploaded 10,000 images, which you have manually
             labeled as containing cats or not. You also have a larger set of 200,000 images that you
             downloaded off the internet. How should you define train/dev/test sets?

             Since the 10,000 user images closely reflect the actual probability distribution of data you

             want to do well on, you might use that for your dev and test sets. If you are training a
             data-hungry deep learning algorithm, you might give it the additional 200,000 internet
             images for training. Thus, your training and dev/test sets come from different probability
             distributions. How does this affect your work?

             Instead of partitioning our data into train/dev/test sets, we could take all 210,000 images we
             have, and randomly shuffle them into train/dev/test sets. In this case, all the data comes

             from the same distribution. But I recommend against this method, because about
             205,000/210,000 ≈ 97.6% of your dev/test data would come from internet images, which
             does not reflect the actual distribution you want to do well on. Remember our
             recommendation on choosing dev/test sets:

                        Choose dev and test sets to reflect data you expect to get in the future

                        and want to do well on.
             Most of the academic literature on machine learning assumes that the training set, dev set
                                                                11
             and test set all come from the same distribution.  In the early days of machine learning, data
             was scarce. We usually only had one dataset drawn from some probability distribution. So
             we would randomly split that data into train/dev/test sets, and the assumption that all the
             data was coming from the same source was usually satisfied.










             11  There is some academic research on training and testing on different distributions. Examples
             include “domain adaptation,” “transfer learning” and “multitask learning.” But there is still a huge
             gap between theory and practice. If you train on dataset A and test on some very different type of data
             B, luck could have a huge effect on how well your algorithm performs. (Here, “luck” includes the
             researcher’s hand-designed features for the particular task, as well as other factors that we just don’t
             understand yet.) This makes the academic study of training and testing on different distributions
             difficult to carry out in a systematic way.


             Page 71                            Machine Learning Yearning-Draft                       Andrew Ng
   66   67   68   69   70   71   72   73   74   75   76