Page 63 -
P. 63
32 Plotting learning curves
Suppose you have a very small training set of 100 examples. You train your algorithm using a
randomly chosen subset of 10 examples, then 20 examples, then 30, up to 100, increasing
the number of examples by intervals of ten. You then use these 10 data points to plot your
learning curve. You might find that the curve looks slightly noisy (meaning that the values
are higher/lower than expected) at the smaller training set sizes.
When training on just 10 randomly chosen examples, you might be unlucky and have a
particularly “bad” training set, such as one with many ambiguous/mislabeled examples. Or,
you might get lucky and get a particularly “good” training set. Having a small training set
means that the dev and training errors may randomly fluctuate.
If your machine learning application is heavily skewed toward one class (such as a cat
classification task where the fraction of negative examples is much larger than positive
examples), or if it has a huge number of classes (such as recognizing 100 different animal
species), then the chance of selecting an especially “unrepresentative” or bad training set is
also larger. For example, if 80% of your examples are negative examples (y=0), and only
20% are positive examples (y=1), then there is a chance that a training set of 10 examples
contains only negative examples, thus making it very difficult for the algorithm to learn
something meaningful.
If the noise in the training curve makes it hard to see the true trends, here are two solutions:
• Instead of training just one model on 10 examples, instead select several (say 3-10)
10
different randomly chosen training sets of 10 examples by sampling with replacement
from your original set of 100. Train a different model on each of these, and compute the
training and dev set error of each of the resulting models. Compute and plot the average
training error and average dev set error.
• If your training set is skewed towards one class, or if it has many classes, choose a
“balanced” subset instead of 10 training examples at random out of the set of 100. For
example, you can make sure that 2/10 of the examples are positive examples, and 8/10 are
10 Here’s what sampling with replacement means: You would randomly pick 10 different examples out of the 100 to form
your first training set. Then to form the second training set, you would again pick 10 examples, but without taking into
account what had been chosen in the first training set. Thus, it is possible for one specific example to appear in both the
first and second training sets. In contrast, if you were sampling without replacement, the second training set would be
chosen from just the 90 examples that had not been chosen the first time around. In practice, sampling with or without
replacement shouldn’t make a huge difference, but the former is common practice.
Page 63 Machine Learning Yearning-Draft Andrew Ng