Page 89 -
P. 89

Given a reward function ​R(T), ​the job of the reinforcement learning algorithm is to control
             the helicopter so that it  achieves max​  R(T). ​However, reinforcement learning algorithms
                                                     T​
             make many approximations and may not succeed in achieving this maximization.

             Suppose you have picked some reward ​R(.)​ and have run your learning algorithm. However,
             its performance appears far worse than your human pilot—the landings are bumpier and
             seem less safe than what a human pilot achieves. How can you tell if the fault is with the

             reinforcement learning algorithm—which is trying to carry out a trajectory that achieves
             max​  ​R(T)​—or if the fault is with the reward function—which is trying to measure as well as
                  T​
             specify the ideal tradeoff between ride bumpiness and accuracy of landing spot?


             To apply the Optimization Verification test, let ​T​     be the trajectory achieved by the
                                                                human​
             human pilot, and let ​T​  be the trajectory achieved by the algorithm. According to our
                                     out ​
             description above, ​T​    is a superior trajectory to ​T​ . Thus, the key test is the following:
                                  human ​                           out​
             Does it hold true that ​R​(​T​  ) > ​R​(​T​ )?
                                        human​      out​

             Case 1: If this inequality holds, then the reward function ​R​(.) is correctly rating ​T​  as
                                                                                                  human ​
             superior to ​T​ . But our reinforcement learning algorithm is finding the inferior ​T​   This
                           out​                                                                    out. ​
             suggests that working on improving our reinforcement learning algorithm is worthwhile.

             Case 2: The inequality does not hold: ​R​(​T​   ) ≤ ​R​(​T​ ). This means ​R​(.) assigns a worse
                                                         human​      out​
             score to ​T​   even though it is the superior trajectory. You should work on improving ​R​(.) to
                       human ​
             better capture the tradeoffs that correspond to a good landing.

             Many machine learning applications have this “pattern” of optimizing an approximate
             scoring function Score​ (.) using an approximate search algorithm. Sometimes, there is no
                                     x​
             specified input ​x​, so this reduces to just Score(.). In our example above, the scoring function
             was the reward function Score(​T​)=R(​T​), and the optimization algorithm was the
             reinforcement learning algorithm trying to execute a good trajectory ​T​.

             One difference between this and earlier examples is that, rather than comparing to an

             “optimal” output, you were instead comparing to human-level performance ​T​           .We
                                                                                               human​
             assumed ​T​      is pretty good, even if not optimal. In general, so long as you have some y* (in
                        human​
             this example, ​T​    ) that is a superior output to the performance of your current learning
                             human​
             algorithm—even if it is not the “optimal” output—then the Optimization Verification test can
             indicate whether it is more promising to improve the optimization algorithm or  the scoring
             function.








             Page 89                            Machine Learning Yearning-Draft                       Andrew Ng
   84   85   86   87   88   89   90   91   92   93   94