Page 19 -
P. 19
7 How large do the dev/test sets need to be?
The dev set should be large enough to detect differences between algorithms that you are
trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an
accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1%
difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is
small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000
2
examples, you will have a good chance of detecting an improvement of 0.1%.
For mature and important applications—for example, advertising, web search, and product
recommendations—I have also seen teams that are highly motivated to eke out even a 0.01%
improvement, since it has a direct impact on the company’s profits. In this case, the dev set
could be much larger than 10,000, in order to detect even smaller improvements.
How about the size of the test set? It should be large enough to give high confidence in the
overall performance of your system. One popular heuristic had been to use 30% of your data
for your test set. This works well when you have a modest number of examples—say 100 to
10,000 examples. But in the era of big data where we now have machine learning problems
with sometimes more than a billion examples, the fraction of data allocated to dev/test sets
has been shrinking, even as the absolute number of examples in the dev/test sets has been
growing. There is no need to have excessively large dev/test sets beyond what is needed to
evaluate the performance of your algorithms.
2 In theory, one could also test if a change to an algorithm makes a statistically significant difference
on the dev set. In practice, most teams don’t bother with this (unless they are publishing academic
research papers), and I usually do not find statistical significance tests useful for measuring interim
progress.
Page 19 Machine Learning Yearning-Draft Andrew Ng