Page 75 -
P. 75

38 How to decide whether to include

             inconsistent data



             Suppose you want to learn to predict housing prices in New York City. Given the size of a
             house (input feature x), you want to predict the price (target label y).

             Housing prices in New York City are very high. Suppose you have a second dataset of
             housing prices in Detroit, Michigan, where housing prices are much lower. Should you

             include this data in your training set?

             Given the same size x, the price of a house y is very different depending on whether it is in
             New York City or in Detroit. If you only care about predicting New York City housing prices,
             putting the two datasets together will hurt your performance.  In this case, it would be better
                                                         13
             to leave out the inconsistent Detroit data.

             How is this New York City vs. Detroit example different from the  mobile app vs. internet cat
             images example?

             The cat image example is different because, given an input picture x, one can reliably predict
             the label y indicating whether there is a cat, even without knowing if the image is an internet
             image or a mobile app image. I.e., there is a function f(x) that reliably maps from the input x

             to the target output y, even without knowing the origin of x. Thus, the task of recognition
             from internet images is “consistent” with the task of recognition from mobile app images.
             This means there was little downside (other than computational cost) to including all the
             data, and some possible significant upside. In contrast, New York City and Detroit, Michigan
             data are not consistent. Given the same x (size of house), the price is very different
             depending on where the house is.














             13  There is one way to address the problem of Detroit data being inconsistent with New York City
             data, which is to add an extra feature to each training example indicating the city. Given an input
             x—which now specifies the city—the target value of y is now unambiguous. However, in practice I do
             not see this done frequently.


             Page 75                            Machine Learning Yearning-Draft                       Andrew Ng
   70   71   72   73   74   75   76   77   78   79   80