Page 89 -

P. 89

Given a reward function R(T), the job of the reinforcement learning algorithm is to control
the helicopter so that it achieves max R(T). However, reinforcement learning algorithms
T
make many approximations and may not succeed in achieving this maximization.

Suppose you have picked some reward R(.) and have run your learning algorithm. However,
its performance appears far worse than your human pilot—the landings are bumpier and
seem less safe than what a human pilot achieves. How can you tell if the fault is with the

reinforcement learning algorithm—which is trying to carry out a trajectory that achieves
max R(T)—or if the fault is with the reward function—which is trying to measure as well as
T
specify the ideal tradeoff between ride bumpiness and accuracy of landing spot?

To apply the Optimization Verification test, let T be the trajectory achieved by the
human
human pilot, and let T be the trajectory achieved by the algorithm. According to our
out
description above, T is a superior trajectory to T . Thus, the key test is the following:
human out
Does it hold true that R(T ) > R(T )?
human out

Case 1: If this inequality holds, then the reward function R(.) is correctly rating T as
human
superior to T . But our reinforcement learning algorithm is finding the inferior T This
out out.
suggests that working on improving our reinforcement learning algorithm is worthwhile.

Case 2: The inequality does not hold: R(T ) ≤ R(T ). This means R(.) assigns a worse
human out
score to T even though it is the superior trajectory. You should work on improving R(.) to
human
better capture the tradeoffs that correspond to a good landing.

Many machine learning applications have this “pattern” of optimizing an approximate
scoring function Score (.) using an approximate search algorithm. Sometimes, there is no
x
specified input x, so this reduces to just Score(.). In our example above, the scoring function
was the reward function Score(T)=R(T), and the optimization algorithm was the
reinforcement learning algorithm trying to execute a good trajectory T.

One difference between this and earlier examples is that, rather than comparing to an

“optimal” output, you were instead comparing to human-level performance T .We
human
assumed T is pretty good, even if not optimal. In general, so long as you have some y* (in
human
this example, T ) that is a superior output to the performance of your current learning
human
algorithm—even if it is not the “optimal” output—then the Optimization Verification test can
indicate whether it is more promising to improve the optimization algorithm or the scoring
function.

Page 89 Machine Learning Yearning-Draft Andrew Ng

84 85 86 87 88 89 90 91 92 93 94