Page 83 -
P. 83

To take one more example, suppose you are building a computer vision system to recognize
             cars. Suppose you partner with a video gaming company, which has computer graphics

             models of several cars. To train your algorithm, you use the models to generate synthetic
             images of cars. Even if the synthesized images look very realistic, this approach (which has
             been independently proposed by many people) will probably not work well. The video game
             might have ~20 car designs in the entire video game. It is very expensive to build a 3D car
             model of a car; if you were playing the game, you probably wouldn’t notice that you’re seeing
             the same cars over and over, perhaps only painted differently. I.e., this data looks very

             realistic to you. But compared to the set of all cars out on roads—and therefore what you’re
             likely to see in the dev/test sets—this set of 20 synthesized cars captures only a minuscule
             fraction of the world’s distribution of cars. Thus if your 100,000 training examples all come
             from these 20 cars, your system will “overfit” to these 20 specific car designs, and it will fail
             to generalize well to dev/test sets that include other car designs.


             When synthesizing data, put some thought into whether you’re really synthesizing a
             representative set of examples. Try to avoid giving the synthesized data properties that
             makes it possible for a learning algorithm to distinguish synthesized from non-synthesized
             examples—such as if all the synthesized data comes from one of 20 car designs, or all the
             synthesized audio comes from only 1 hour of car noise. This advice can be hard to follow.

             When working on data synthesis, my teams have sometimes taken weeks before we produced

             data with details that are close enough to the actual distribution for the synthesized data to
             have a significant effect. But if you are able to get the details right, you can suddenly access a
             far larger training set than before.









             Page 83                            Machine Learning Yearning-Draft                       Andrew Ng
   78   79   80   81   82   83   84   85   86   87   88