Page 75 -
P. 75
38 How to decide whether to include
inconsistent data
Suppose you want to learn to predict housing prices in New York City. Given the size of a
house (input feature x), you want to predict the price (target label y).
Housing prices in New York City are very high. Suppose you have a second dataset of
housing prices in Detroit, Michigan, where housing prices are much lower. Should you
include this data in your training set?
Given the same size x, the price of a house y is very different depending on whether it is in
New York City or in Detroit. If you only care about predicting New York City housing prices,
putting the two datasets together will hurt your performance. In this case, it would be better
13
to leave out the inconsistent Detroit data.
How is this New York City vs. Detroit example different from the mobile app vs. internet cat
images example?
The cat image example is different because, given an input picture x, one can reliably predict
the label y indicating whether there is a cat, even without knowing if the image is an internet
image or a mobile app image. I.e., there is a function f(x) that reliably maps from the input x
to the target output y, even without knowing the origin of x. Thus, the task of recognition
from internet images is “consistent” with the task of recognition from mobile app images.
This means there was little downside (other than computational cost) to including all the
data, and some possible significant upside. In contrast, New York City and Detroit, Michigan
data are not consistent. Given the same x (size of house), the price is very different
depending on where the house is.
13 There is one way to address the problem of Detroit data being inconsistent with New York City
data, which is to add an extra feature to each training example indicating the city. Given an input
x—which now specifies the city—the target value of y is now unambiguous. However, in practice I do
not see this done frequently.
Page 75 Machine Learning Yearning-Draft Andrew Ng