Page 26 -
P. 26

2. You have overfit to the dev set.

             The process of repeatedly evaluating ideas on the dev set causes your algorithm to gradually
             “overfit” to the dev set. When you are done developing, you will evaluate your system on the
             test set. If you find that your dev set performance is much better than your test set
             performance, it is a sign that you have overfit to the dev set. In this case, get a fresh dev set.

             If you need to track your team’s progress, you can also evaluate your system regularly—say

             once per week or once per month—on the test set. But do not use the test set to make any
             decisions regarding the algorithm, including whether to roll back to the previous week’s
             system. If you do so, you will start to overfit to the test set, and can no longer count on it to
             give a completely unbiased estimate of your system’s performance (which you would need if
             you’re publishing research papers, or perhaps using this metric to make important business

             decisions).

             3. The metric is measuring something other than what the project needs to optimize.

             Suppose that for your cat application, your metric is classification accuracy. This metric
             currently ranks classifier A as superior to classifier B. But suppose you try out both
             algorithms, and find classifier A is allowing occasional pornographic images to slip through.
             Even though classifier A is more accurate, the bad impression left by the occasional

             pornographic image means its performance is unacceptable. What do you do?

             Here, the metric is failing to identify the fact that Algorithm B is in fact better than
             Algorithm A for your product. So, you can no longer trust the metric to pick the best
             algorithm. It is time to change evaluation metrics. For example, you can change the metric to
             heavily penalize letting through pornographic images.  I would strongly recommend picking

             a new metric and using the new metric to explicitly define a new goal for the team, rather
             than proceeding for too long without a trusted metric and reverting to manually choosing
             among classifiers.




             It is quite common to change dev/test sets or evaluation metrics during a project. Having an
             initial dev/test set and metric helps you iterate quickly. If you ever find that the dev/test sets
             or metric are no longer pointing your team in the right direction, it’s not a big deal! Just
             change them and make sure your team knows about the new direction.









             Page 26                            Machine Learning Yearning-Draft                       Andrew Ng
   21   22   23   24   25   26   27   28   29   30   31