Page 68 -
P. 68
34 How to define human-level performance
Suppose you are working on a medical imaging application that automatically makes
diagnoses from x-ray images. A typical person with no previous medical background besides
some basic training achieves 15% error on this task. A junior doctor achieves 10% error. An
experienced doctor achieves 5% error. And a small team of doctors that discuss and debate
each image achieves 2% error. Which one of these error rates defines “human-level
performance”?
In this case, I would use 2% as the human-level performance proxy for our optimal error
rate. You can also set 2% as the desired performance level because all three reasons from the
previous chapter for comparing to human-level performance apply:
• Ease of obtaining labeled data from human labelers. You can get a team of doctors
to provide labels to you with a 2% error rate.
• Error analysis can draw on human intuition. By discussing images with a team of
doctors, you can draw on their intuitions.
• Use human-level performance to estimate the optimal error rate and also set
achievable “desired error rate.” It is reasonable to use 2% error as our estimate of the
optimal error rate. The optimal error rate could be even lower than 2%, but it cannot be
higher, since it is possible for a team of doctors to achieve 2% error. In contrast, it is not
reasonable to use 5% or 10% as an estimate of the optimal error rate, since we know these
estimates are necessarily too high.
When it comes to obtaining labeled data, you might not want to discuss every image with an
entire team of doctors since their time is expensive. Perhaps you can have a single junior
doctor label the vast majority of cases and bring only the harder cases to more experienced
doctors or to the team of doctors.
If your system is currently at 40% error, then it doesn’t matter much whether you use a
junior doctor (10% error) or an experienced doctor (5% error) to label your data and provide
intuitions. But if your system is already at 10% error, then defining the human-level
reference as 2% gives you better tools to keep improving your system.
Page 68 Machine Learning Yearning-Draft Andrew Ng