Page 20 -
P. 20

8 Establish a single-number evaluation metric

             for your team to optimize




             Classification accuracy is an example of a ​single-number evaluation metric​: You run
             your classifier on the dev set (or test set), and get back a single number about what fraction

             of examples it classified correctly. According to this metric, if classifier A obtains 97%
             accuracy, and classifier B obtains 90% accuracy, then we judge classifier A to be superior.

                                               3
             In contrast, Precision and Recall  is not a single-number evaluation metric: It gives two
             numbers for assessing your classifier. Having multiple-number evaluation metrics makes it
             harder to compare algorithms. Suppose your algorithms perform as follows:



              Classifier      Precision        Recall
              A                          95%             90%

              B                          98%             85%

             Here, neither classifier is obviously superior, so it doesn’t immediately guide you toward
             picking one.


             Classifier   Precision    Recall       F1 score
             A                   95%          90%        92.4%



             During development, your team will try a lot of ideas about algorithm architecture, model
             parameters, choice of features, etc. Having a ​single-number evaluation metric​ such as
             accuracy allows you to sort all your models according to their performance on this metric,
             and quickly decide what is working best.


             If you really care about both Precision and Recall, I recommend using one of the standard
             ways to combine them into a single number. For example, one could take the average of
             precision and recall, to end up with a single number.  Alternatively, you can compute the “F1





             3  The Precision of a cat classifier is the fraction of images in the dev (or test) set it labeled as cats that
             really are cats. Its Recall is the percentage of all cat images in the dev (or test) set that it correctly
             labeled as a cat. There is often a tradeoff between having high precision and high recall.


             Page 20                            Machine Learning Yearning-Draft                       Andrew Ng
   15   16   17   18   19   20   21   22   23   24   25