Page 77 -
P. 77

40 Generalizing from the training set to the

             dev set



             Suppose you are applying ML in a setting where the training and the dev/test distributions
             are different. Say, the training set contains Internet images + Mobile images, and the
             dev/test sets contain only Mobile images. However, the algorithm is not working well: It has
             a much higher dev/test set error than you would like. Here are some possibilities of what

             might be wrong:

             1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
                training set distribution.

             2. It does well on the training set, but does not generalize well to previously unseen data
                drawn from the same distribution as the training set​. This is high variance.


             3. It generalizes well to new data drawn from the same distribution as the training set, but
                not to data drawn from the dev/test set distribution. We call this problem ​data
                mismatch​, since it is because the training set data is a poor match for the dev/test set
                data.

             For example, suppose that humans achieve near perfect performance on the cat recognition

             task. Your algorithm achieves this:

             • 1% error on the training set

             • 1.5% error on data drawn from the same distribution as the training set that the algorithm
               has not seen


             • 10% error on the dev set

             In this case, you clearly have a data mismatch problem. To address this, you might try to
             make the training data more similar to the dev/test data. We discuss some techniques for
             this later.

             In order to diagnose to what extent an algorithm suffers from each of the problems 1-3

             above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
             all the available training data, you can split it into two subsets: The actual training set which
             the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
             we will not train on.

             You now have four subsets of data:



             Page 77                            Machine Learning Yearning-Draft                       Andrew Ng
   72   73   74   75   76   77   78   79   80   81   82