Page 825 - Mechanical Engineers' Handbook (Volume 2)
P. 825

816   Neural Networks in Feedback Control Systems

                                                                 (x ):
                             1. Find the value for the prescribed policy u j  k
                                                    V (x )   r(x ,u (x ))    V (x k 1 )
                                                                j
                                                                 k
                                                      j
                                                       k
                                                              k
                                                                        j
                             2. Policy improvement:
                                                 (x )   arg min (r(x ,u )    V (x  ))
                                              u j 1  k          k  k    j  k 1
                                                          u k
                          Werbos 33  and others (see Section 11) showed how to implement ADP controllers by four
                          basic techniques—HDP, DHP, and their action-dependent forms ADHDP, and ADDHP—to
                          be described next.
                          Heuristic Dynamic Programming (HDP). In HDP, one approximates the value by a critic
                          NN with tunable parameters w and the control by an action-generating NN with tunable
                                                  j
                          parameters v so that
                                   j
                                                 Critic NN:   V (x )   V(x ,w )
                                                               j  k    k  j
                                                 Action NN:   u (x )   u(x ,v )
                                                               j  k    k  j
                          HDP then proceeds as follows:
                             Critic Update
                                 Find the desired target value using
                                                  V D k,j 1    r(x ,u (x ))    V(x k 1 ,w )
                                                              j
                                                            k
                                                                            j
                                                                k
                             Update critic weights using recursive least squares (RLS), backprop, and so on:
                                                              V
                                                w j 1    w   
 j  [V D k,j 1    V(x ,w )]
                                                                           k
                                                                             j
                                                       j
                                                              w j
                             Action Update
                                 Find the desired target action using
                                               u D    arg min (r(x ,u )    V(x  ,w  ))
                                                k,j 1           k  k     k 1  j 1
                                                          u k
                             Update critic weights using RLS, backprop, and so on:
                                                              u
                                                 v j 1    v   
 k  [u D k,j 1    u(x ,v )]
                                                                          k
                                                        j
                                                                            j
                                                              v j
                          This procedure is straightforward to implement given today’s software (e.g., MATLAB). The
                          value required for the next state x k 1  may be found either using the dynamics equation (1)
                          or the next state can be observed from the actual system.
                          Dual Heuristic Programming (DHP). Noting that the control only depends on the value
                          function gradient [e.g., see (9)], it is advantageous to approximate not the value but its
                          gradient using a NN. This yields a more complex algorithm, but DHP converges faster than
                          HDP. Details are in Ref. 33.

                          Q-Learning or Action-Dependent HDP. A function that is more advantageous than the value
                                                                      47
                                                                                 8
                          function for ADP is the Q function, defined by Watkins and Werbos as
                                                  Q(x ,u )   r(x ,u )    V(x k 1 )
                                                     k
                                                             k
                                                       k
                                                               k
                          Note that Q is a function of both x and the control action u and that
                                                                         k
                                                     k
   820   821   822   823   824   825   826   827   828   829   830