Page 19 -
P. 19

7 How large do the dev/test sets need to be?




             The dev set should be large enough to detect differences between algorithms that you are
             trying out. For example, if classifier A has an accuracy of 90.0% and classifier B has an
             accuracy of 90.1%, then a dev set of 100 examples would not be able to detect this 0.1%
             difference. Compared to other machine learning problems I’ve seen, a 100 example dev set is

             small. Dev sets with sizes from 1,000 to 10,000 examples are common. With 10,000
                                                                                              2
             examples, you will have a good chance of detecting an improvement of 0.1%.

             For mature and important applications—for example, advertising, web search, and product
             recommendations—I have also seen teams that are highly motivated to eke out even a 0.01%
             improvement, since it has a direct impact on the company’s profits. In this case, the dev set
             could be much larger than 10,000, in order to detect even smaller improvements.


             How about the size of the test set? It should be large enough to give high confidence in the
             overall performance of your system. One popular heuristic had been to use 30% of your data
             for your test set. This works well when you have a modest number of examples—say 100 to
             10,000 examples. But in the era of big data where we now have machine learning problems
             with sometimes more than a billion examples, the fraction of data allocated to dev/test sets
             has been shrinking, even as the absolute number of examples in the dev/test sets has been

             growing. There is no need to have excessively large dev/test sets beyond what is needed to
             evaluate the performance of your algorithms.

























             2  In theory, one could also test if a change to an algorithm makes a statistically significant difference
             on the dev set. In practice, most teams don’t bother with this (unless they are publishing academic
             research papers), and I usually do not find statistical significance tests useful for measuring interim
             progress.


             Page 19                            Machine Learning Yearning-Draft                       Andrew Ng
   14   15   16   17   18   19   20   21   22   23   24