Page 81 -
P. 81

42 Addressing data mismatch


             Suppose you have developed a speech recognition system that does very well on the training
             set and on the training dev set. However, it does poorly on your dev set: You have a data

             mismatch problem. What can you do?

             I recommend that you: (i) Try to understand what properties of the data differ between the
             training and the dev set distributions. (ii) Try to find more training data that better matches
                                                                           14
             the dev set examples that your algorithm has trouble with.
             For example, suppose you carry out an error analysis on the speech recognition dev set: You

             manually go through 100 examples, and try to understand where the algorithm is making
             mistakes. You find that your system does poorly because most of the audio clips in the dev
             set are taken within a car, whereas most of the training examples were recorded against a
             quiet background. The engine and road noise dramatically worsen the performance of your
             speech system. In this case, you might try to acquire more training data comprising audio
             clips that were taken in a car. The purpose of the error analysis is to understand the

             significant differences between the training and the dev set, which is what leads to the data
             mismatch.

             If your training and training dev sets include audio recorded within a car, you should also
             double-check your system’s performance on this subset of data. If it is doing well on the car
             data in the training set but not on car data in the training dev set, then this further validates
             the hypothesis that getting more car data would help. This is why we discussed the

             possibility of including in your training set some data drawn from the same distribution as
             your dev/test set in the previous chapter. Doing so allows you to compare your performance
             on the car data in the training set vs. the dev/test set.

             Unfortunately, there are no guarantees in this process. For example, if you don't have any
             way to get more training data that better match the dev set data, you might not have a clear

             path towards improving performance.









             14 There is also some research on “domain adaptation”—how to train an algorithm on one distribution
             and have it generalize to a different distribution. These methods are typically applicable only in
             special types of problems and are much less widely used than the ideas described in this chapter.


             Page 81                            Machine Learning Yearning-Draft                       Andrew Ng
   76   77   78   79   80   81   82   83   84   85   86