Page 72 -
P. 72
But in the era of big data, we now have access to huge training sets, such as cat internet
images. Even if the training set comes from a different distribution than the dev/test set, we
still want to use it for learning since it can provide a lot of information.
For the cat detector example, instead of putting all 10,000 user-uploaded images into the
dev/test sets, we might instead put 5,000 into the dev/test sets. We can put the remaining
5,000 user-uploaded examples into the training set. This way, your training set of 205,000
examples contains some data that comes from your dev/test distribution along with the
200,000 internet images. We will discuss in a later chapter why this method is helpful.
Let’s consider a second example. Suppose you are building a speech recognition system to
transcribe street addresses for a voice-controlled mobile map/navigation app. You have
20,000 examples of users speaking street addresses. But you also have 500,000 examples of
other audio clips with users speaking about other topics. You might take 10,000 examples of
street addresses for the dev/test sets, and use the remaining 10,000, plus the additional
500,000 examples, for training.
We will continue to assume that your dev data and your test data come from the same
distribution. But it is important to understand that different training and dev/test
distributions offer some special challenges.
Page 72 Machine Learning Yearning-Draft Andrew Ng