Page 825 - Mechanical Engineers' Handbook (Volume 2)

P. 825

816 Neural Networks in Feedback Control Systems

(x ):
1. Find the value for the prescribed policy u j k
V (x ) r(x ,u (x )) V (x k 1 )
j
k
j
k
k
j
2. Policy improvement:
(x ) arg min (r(x ,u ) V (x ))
u j 1 k k k j k 1
u k
Werbos 33 and others (see Section 11) showed how to implement ADP controllers by four
basic techniques—HDP, DHP, and their action-dependent forms ADHDP, and ADDHP—to
be described next.
Heuristic Dynamic Programming (HDP). In HDP, one approximates the value by a critic
NN with tunable parameters w and the control by an action-generating NN with tunable
j
parameters v so that
j
Critic NN: V (x ) V(x ,w )
j k k j
Action NN: u (x ) u(x ,v )
j k k j
HDP then proceeds as follows:
Critic Update
Find the desired target value using
V D k,j 1 r(x ,u (x )) V(x k 1 ,w )
j
k
j
k
Update critic weights using recursive least squares (RLS), backprop, and so on:
V
w j 1 w
j [V D k,j 1 V(x ,w )]
k
j
j
w j
Action Update
Find the desired target action using
u D arg min (r(x ,u ) V(x ,w ))
k,j 1 k k k 1 j 1
u k
Update critic weights using RLS, backprop, and so on:
u
v j 1 v
k [u D k,j 1 u(x ,v )]
k
j
j
v j
This procedure is straightforward to implement given today’s software (e.g., MATLAB). The
value required for the next state x k 1 may be found either using the dynamics equation (1)
or the next state can be observed from the actual system.
Dual Heuristic Programming (DHP). Noting that the control only depends on the value
function gradient [e.g., see (9)], it is advantageous to approximate not the value but its
gradient using a NN. This yields a more complex algorithm, but DHP converges faster than
HDP. Details are in Ref. 33.

Q-Learning or Action-Dependent HDP. A function that is more advantageous than the value
47
8
function for ADP is the Q function, deﬁned by Watkins and Werbos as
Q(x ,u ) r(x ,u ) V(x k 1 )
k
k
k
k
Note that Q is a function of both x and the control action u and that
k
k

820 821 822 823 824 825 826 827 828 829 830