Page 35 -
P. 35

•   Overall accuracy on dev set.………………. 98.0% (2.0% overall error.)
             •   Errors due to mislabeled examples……. 0.6%. (30% of dev set errors.)
             •   Errors due to other causes………………… 1.4% (70% of dev set errors)

             30% of your errors are due to the mislabeled dev set images, adding significant error to your
             estimates of accuracy. It is now worthwhile to improve the quality of the labels in the dev set.

             Tackling the mislabeled examples will help you figure out if a classifier’s error is closer to
             1.4% or 2%—a significant relative difference.

             It is not uncommon to start off tolerating some mislabeled dev/test set examples, only later
             to change your mind as your system improves so that the fraction of mislabeled examples
             grows relative to the total set of errors.

             The last chapter explained how you can improve error categories such as Dog, Great Cat and

             Blurry through algorithmic improvements. You have learned in this chapter that you can
             work on the Mislabeled category as well—through improving the data’s labels.

             Whatever process you apply to fixing dev set labels, remember to apply it to the test set
             labels too so that your dev and test sets continue to be drawn from the same distribution.
             Fixing your dev and test sets together would prevent the problem we discussed in Chapter 6,

             where your team optimizes for dev set performance only to realize later that they are being
             judged on a different criterion based on a different test set.

             If you decide to improve the label quality, consider double-checking both the labels of
             examples that your system misclassified as well as labels of examples it correctly classified. It
             is possible that both the original label and your learning algorithm were wrong on an
             example. If you fix only the labels of examples that your system had misclassified, you might

             introduce bias into your evaluation. If you have 1,000 dev set examples, and if your classifier
             has 98.0% accuracy, it is easier to examine the 20 examples it misclassified than to examine
             all 980 examples classified correctly. Because it is easier in practice to check only the
             misclassified examples,  bias does creep into some dev sets. This bias is acceptable if you are
             interested only in developing a product or application, but it would be a problem if you plan

             to use the result in an academic research paper or need a completely unbiased measure of
             test set accuracy.














             Page 35                            Machine Learning Yearning-Draft                       Andrew Ng
   30   31   32   33   34   35   36   37   38   39   40