Page 77 -

P. 77

40 Generalizing from the training set to the

dev set

Suppose you are applying ML in a setting where the training and the dev/test distributions
are different. Say, the training set contains Internet images + Mobile images, and the
dev/test sets contain only Mobile images. However, the algorithm is not working well: It has
a much higher dev/test set error than you would like. Here are some possibilities of what

might be wrong:

1. It does not do well on the training set. This is the problem of high (avoidable) bias on the
training set distribution.

2. It does well on the training set, but does not generalize well to previously unseen data
drawn from the same distribution as the training set. This is high variance.

3. It generalizes well to new data drawn from the same distribution as the training set, but
not to data drawn from the dev/test set distribution. We call this problem data
mismatch, since it is because the training set data is a poor match for the dev/test set
data.

For example, suppose that humans achieve near perfect performance on the cat recognition

task. Your algorithm achieves this:

• 1% error on the training set

• 1.5% error on data drawn from the same distribution as the training set that the algorithm
has not seen

• 10% error on the dev set

In this case, you clearly have a data mismatch problem. To address this, you might try to
make the training data more similar to the dev/test data. We discuss some techniques for
this later.

In order to diagnose to what extent an algorithm suffers from each of the problems 1-3

above, it will be useful to have another dataset. Specifically, rather than giving the algorithm
all the available training data, you can split it into two subsets: The actual training set which
the algorithm will train on, and a separate set, which we will call the “Training dev” set, that
we will not train on.

You now have four subsets of data:

Page 77 Machine Learning Yearning-Draft Andrew Ng

72 73 74 75 76 77 78 79 80 81 82