Page 38 -
P. 38
18 How big should the Eyeball and Blackbox
dev sets be?
Your Eyeball dev set should be large enough to give you a sense of your algorithm’s major
error categories. If you are working on a task that humans do well (such as recognizing cats
in images), here are some rough guidelines:
• An eyeball dev set in which your classifier makes 10 mistakes would be considered very
small. With just 10 errors, it’s hard to accurately estimate the impact of different error
categories. But if you have very little data and cannot afford to put more into the Eyeball
dev set, it’s better than nothing and will help with project prioritization.
• If your classifier makes ~20 mistakes on eyeball dev examples, you would start to get a
rough sense of the major error sources.
• With ~50 mistakes, you would get a good sense of the major error sources.
• With ~100 mistakes, you would get a very good sense of the major sources of errors. I’ve
seen people manually analyze even more errors—sometimes as many as 500. There is no
harm in this as long as you have enough data.
Say your classifier has a 5% error rate. To make sure you have ~100 mislabeled examples in
the Eyeball dev set, the Eyeball dev set would have to have about 2,000 examples (since
0.05*2,000 = 100). The lower your classifier’s error rate, the larger your Eyeball dev set
needs to be in order to get a large enough set of errors to analyze.
If you are working on a task that even humans cannot do well, then the exercise of examining
an Eyeball dev set will not be as helpful because it is harder to figure out why the algorithm
didn’t classify an example correctly. In this case, you might omit having an Eyeball dev set.
We discuss guidelines for such problems in a later chapter.
Page 38 Machine Learning Yearning-Draft Andrew Ng