Page 83 -

P. 83

To take one more example, suppose you are building a computer vision system to recognize
cars. Suppose you partner with a video gaming company, which has computer graphics

models of several cars. To train your algorithm, you use the models to generate synthetic
images of cars. Even if the synthesized images look very realistic, this approach (which has
been independently proposed by many people) will probably not work well. The video game
might have ~20 car designs in the entire video game. It is very expensive to build a 3D car
model of a car; if you were playing the game, you probably wouldn’t notice that you’re seeing
the same cars over and over, perhaps only painted differently. I.e., this data looks very

realistic to you. But compared to the set of all cars out on roads—and therefore what you’re
likely to see in the dev/test sets—this set of 20 synthesized cars captures only a minuscule
fraction of the world’s distribution of cars. Thus if your 100,000 training examples all come
from these 20 cars, your system will “overfit” to these 20 specific car designs, and it will fail
to generalize well to dev/test sets that include other car designs.

When synthesizing data, put some thought into whether you’re really synthesizing a
representative set of examples. Try to avoid giving the synthesized data properties that
makes it possible for a learning algorithm to distinguish synthesized from non-synthesized
examples—such as if all the synthesized data comes from one of 20 car designs, or all the
synthesized audio comes from only 1 hour of car noise. This advice can be hard to follow.

When working on data synthesis, my teams have sometimes taken weeks before we produced

data with details that are close enough to the actual distribution for the synthesized data to
have a significant effect. But if you are able to get the details right, you can suddenly access a
far larger training set than before.

Page 83 Machine Learning Yearning-Draft Andrew Ng

78 79 80 81 82 83 84 85 86 87 88