Page 17 -
P. 17
6 Your dev and test sets should come from the
same distribution
You have your cat app image data segmented into four regions, based on your largest
markets: (i) US, (ii) China, (iii) India, and (iv) Other. To come up with a dev set and a test
set, say we put US and India in the dev set; China and Other in the test set. In other words,
we can randomly assign two of these segments to the dev set, and the other two to the test
set, right?
Once you define the dev and test sets, your team will be focused on improving dev set
performance. Thus, the dev set should reflect the task you want to improve on the most: To
do well on all four geographies, and not only two.
There is a second problem with having different dev and test set distributions: There is a
chance that your team will build something that works well on the dev set, only to find that it
does poorly on the test set. I’ve seen this result in much frustration and wasted effort. Avoid
letting this happen to you.
As an example, suppose your team develops a system that works well on the dev set but not
the test set. If your dev and test sets had come from the same distribution, then you would
have a very clear diagnosis of what went wrong: You have overfit the dev set. The obvious
cure is to get more dev set data.
But if the dev and test sets come from different distributions, then your options are less
clear. Several things could have gone wrong:
1. You had overfit to the dev set.
2. The test set is harder than the dev set. So your algorithm might be doing as well as could
be expected, and no further significant improvement is possible.
Page 17 Machine Learning Yearning-Draft Andrew Ng