Page 81 -
P. 81
42 Addressing data mismatch
Suppose you have developed a speech recognition system that does very well on the training
set and on the training dev set. However, it does poorly on your dev set: You have a data
mismatch problem. What can you do?
I recommend that you: (i) Try to understand what properties of the data differ between the
training and the dev set distributions. (ii) Try to find more training data that better matches
14
the dev set examples that your algorithm has trouble with.
For example, suppose you carry out an error analysis on the speech recognition dev set: You
manually go through 100 examples, and try to understand where the algorithm is making
mistakes. You find that your system does poorly because most of the audio clips in the dev
set are taken within a car, whereas most of the training examples were recorded against a
quiet background. The engine and road noise dramatically worsen the performance of your
speech system. In this case, you might try to acquire more training data comprising audio
clips that were taken in a car. The purpose of the error analysis is to understand the
significant differences between the training and the dev set, which is what leads to the data
mismatch.
If your training and training dev sets include audio recorded within a car, you should also
double-check your system’s performance on this subset of data. If it is doing well on the car
data in the training set but not on car data in the training dev set, then this further validates
the hypothesis that getting more car data would help. This is why we discussed the
possibility of including in your training set some data drawn from the same distribution as
your dev/test set in the previous chapter. Doing so allows you to compare your performance
on the car data in the training set vs. the dev/test set.
Unfortunately, there are no guarantees in this process. For example, if you don't have any
way to get more training data that better match the dev set data, you might not have a clear
path towards improving performance.
14 There is also some research on “domain adaptation”—how to train an algorithm on one distribution
and have it generalize to a different distribution. These methods are typically applicable only in
special types of problems and are much less widely used than the ideas described in this chapter.
Page 81 Machine Learning Yearning-Draft Andrew Ng