Page 68 -
P. 68

34 How to define human-level performance


             Suppose you are working on a medical imaging application that automatically makes

             diagnoses from x-ray images. A typical person with no previous medical background besides
             some basic training achieves 15% error on this task. A junior doctor achieves 10% error. An
             experienced doctor achieves 5% error. And a small team of doctors that discuss and debate
             each image achieves 2% error. Which one of these error rates defines “human-level
             performance”?

             In this case, I would use 2% as the human-level performance proxy for our optimal error

             rate. You can also set 2% as the desired performance level because all three reasons from the
             previous chapter for comparing to human-level performance apply:

             • Ease of obtaining labeled data from human labelers.​ You can get a team of doctors
               to provide labels to you with a 2% error rate.

             • Error analysis can draw on human intuition. ​By discussing images with a team of

               doctors, you can draw on their intuitions.

             • Use human-level performance to estimate the optimal error rate and also set
               achievable “desired error rate.”​ It is reasonable to use 2% error as our estimate of the
               optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
               higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
               reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these

               estimates are necessarily too high.

             When it comes to obtaining labeled data, you might not want to discuss every image with an
             entire team of doctors since their time is expensive. Perhaps you can have a single junior
             doctor label the vast majority of cases and bring only the harder cases to more experienced
             doctors or to the team of doctors.


             If your system is currently at 40% error, then it doesn’t matter much whether you use a
             junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
             intuitions. But if your system is already at 10% error, then defining the human-level
             reference as 2% gives you better tools to keep improving your system.












             Page 68                            Machine Learning Yearning-Draft                       Andrew Ng
   63   64   65   66   67   68   69   70   71   72   73