Page 21 -

P. 21

score,” which is a modified way of computing their average, and works better than simply
taking the mean. 4

Classifier Precision Recall F1 score

A 95% 90% 92.4%

B 98% 85% 91.0%

Having a single-number evaluation metric speeds up your ability to make a decision when
you are selecting among a large number of classifiers. It gives a clear preference ranking
among all of them, and therefore a clear direction for progress.

As a final example, suppose you are separately tracking the accuracy of your cat classifier in

four key markets: (i) US, (ii) China, (iii) India, and (iv) Other. This gives four metrics. By
taking an average or weighted average of these four numbers, you end up with a single
number metric. Taking an average or weighted average is one of the most common ways to
combine multiple metrics into one.

4 If you want to learn more about the F1 score, see https://en.wikipedia.org/wiki/F1_score. It is the
“harmonic mean” between Precision and Recall, and is calculated as 2/((1/Precision)+(1/Recall)).

Page 21 Machine Learning Yearning-Draft Andrew Ng

16 17 18 19 20 21 22 23 24 25 26