Page 35 -
P. 35
• Overall accuracy on dev set.………………. 98.0% (2.0% overall error.)
• Errors due to mislabeled examples……. 0.6%. (30% of dev set errors.)
• Errors due to other causes………………… 1.4% (70% of dev set errors)
30% of your errors are due to the mislabeled dev set images, adding significant error to your
estimates of accuracy. It is now worthwhile to improve the quality of the labels in the dev set.
Tackling the mislabeled examples will help you figure out if a classifier’s error is closer to
1.4% or 2%—a significant relative difference.
It is not uncommon to start off tolerating some mislabeled dev/test set examples, only later
to change your mind as your system improves so that the fraction of mislabeled examples
grows relative to the total set of errors.
The last chapter explained how you can improve error categories such as Dog, Great Cat and
Blurry through algorithmic improvements. You have learned in this chapter that you can
work on the Mislabeled category as well—through improving the data’s labels.
Whatever process you apply to fixing dev set labels, remember to apply it to the test set
labels too so that your dev and test sets continue to be drawn from the same distribution.
Fixing your dev and test sets together would prevent the problem we discussed in Chapter 6,
where your team optimizes for dev set performance only to realize later that they are being
judged on a different criterion based on a different test set.
If you decide to improve the label quality, consider double-checking both the labels of
examples that your system misclassified as well as labels of examples it correctly classified. It
is possible that both the original label and your learning algorithm were wrong on an
example. If you fix only the labels of examples that your system had misclassified, you might
introduce bias into your evaluation. If you have 1,000 dev set examples, and if your classifier
has 98.0% accuracy, it is easier to examine the 20 examples it misclassified than to examine
all 980 examples classified correctly. Because it is easier in practice to check only the
misclassified examples, bias does creep into some dev sets. This bias is acceptable if you are
interested only in developing a product or application, but it would be a problem if you plan
to use the result in an academic research paper or need a completely unbiased measure of
test set accuracy.
Page 35 Machine Learning Yearning-Draft Andrew Ng