Page 826 - Mechanical Engineers' Handbook (Volume 2)
P. 826

11 Historical Development, Referenced Work, and Further Study  817

                                                       Q (x ,h(x ))   V (x )
                                                              k
                                                                    h
                                                        h
                                                          k
                                                                      k
                           where subscript h denotes a prescribed control or policy sequence u   h(x ). A recursion
                                                                                 k
                                                                                       k
                           for Q is given by
                                                Q (x ,u )   r(x ,u )    Q (x k 1 ,h(x k 1 ))
                                                                    h
                                                           k
                                                     k
                                                   k
                                                             k
                                                 h
                           In terms of Q, Bellman’s principle is particularly easy to write; in fact, defining the optimal
                           Q value as
                                                                           ))
                                                  Q*(x ,u )   r(x ,u )    V*(x k 1
                                                      k  k    k  k
                           one has the optimal value as
                                                     V*(x )   min(Q*(x ,u ))
                                                                      k
                                                                    k
                                                         k
                                                              u k
                           The optimal control policy is given by
                                                    h*(x )   arg min (Q*(x ,u ))
                                                       k
                                                                        k
                                                                      k
                                                               u k
                           Watkins showed that the following successive iteration scheme, known as Q learning, con-
                           verges to the optimal solution:
                                                                    (x ):
                                                                      k
                              1. Find the Q value for the prescribed policy h j
                                                   Q (x ,u )   r(x ,u )    Q (x k 1 ,h (x k 1 ))
                                                                      j
                                                                            j
                                                              k
                                                                k
                                                        k
                                                      k
                                                    j
                              2. Policy improvement:
                                                      (x )   arg min (Q (x ,u ))
                                                   h j 1  k         j  k  k
                                                                u k
                           Using NN to approximate the Q function and the policy, one can write the ADHDP algorithm
                           in a very straightforward manner. Since the control input action u is now explicitly an input
                                                                              k
                           to the critic NN, this is known as action-dependent HDP. Q learning converges faster than
                           HDP and can be used in the case of unknown system dynamics. 48
                              An action-dependent version of DHP is also available wherein the gradients of the Q
                           function are approximated using NNs. Note that two NNs are needed, since there are two
                           gradients, as Q is a function of both x and u .
                                                               k
                                                         k
            11   HISTORICAL DEVELOPMENT, REFERENCED WORK, AND FURTHER STUDY
                           A firm foundation for the use of NNs in feedback control systems has been developed over
                           the years by many researchers. Included here is a historical development and references to
                           the body of work in neurocontrol.
            11.1  NN for Feedback Control
                                                                                     8
                           The use of NNs in feedback control systems was first proposed by Werbos. Since then, NNs
                           control has been studied by many researchers. Recently, NNs have entered the mainstream
                           of control theory as a natural extension of adaptive control to systems that are nonlinear in
                           the tunable parameters. The state of NN control is well illustrated by papers in the Auto-
                           matica Special issue on NN control. 49  Overviews of the initial work in NN control are
                           provided by Miller et al. 50  and the Handbook of Intelligent Control, 51  which highlighted a
                           host of difficulties to be addressed for closed-loop control applications. Neural network
   821   822   823   824   825   826   827   828   829   830   831