Page 828 - Mechanical Engineers' Handbook (Volume 2)

P. 828

11 Historical Development, Referenced Work, and Further Study 819

control aircraft and missiles. Feedback linearization using NNs has been addressed by Chen
70
59
and Khalil, Yesildirek and Lewis, Ge et al., and others. NNs were used with backstep-
22
ping 25 by Lewis, et al., 15 Arslan and Basar, Wang and Huang, Ge et al., 20,21 and others.
71
72
NNs have been used in conjunction with the Isidori–Byrnes regulator equations for
output-tracking control by Wang and Huang. A multimodel NN control approach has been
72
given by Narendra and Balakrishnan. 73 Applications of NN control have been extended to
partial differential equation systems by Padhi et al. 74 NNs have been used for control of
stochastic systems by Poznyak and Ljung. Parisini and co-workers have developed receding
75
horizon controllers based on NNs and hybrid discrete-event NN controllers. 77
76
In practical implementations of NN controllers there remain problems to overcome.
Weight initialization still remains an issue, and one may also ﬁnd that the NN weights
become unbounded despite proofs to the contrary. Practical implementation issues were ad-
dressed by Chen and Chang, 78 Gutierrez and Lewis, 79 and others. Random initialization of
the ﬁrst-layer NN weights often works in practice, and work by Igelnik and Pao shows that
7
it is theoretically defensible. Computational complexity makes NNs with many hidden layer
neurons difﬁcult to implement. Recently, work has intensiﬁed in wavelets, NNs that have
localized basis functions, and NNs that are self-organizing in the sense of adding or deleting
neurons automatically. 36,80,81
By now it is understood that NNs offer an elegant extension of adaptive control and
other techniques to systems that are nonlinear in the unknown parameters. The universal
approximation properties of NNs 2,3 avoid the use of specialized basis sets, including regres-
sion matrices. Formalized improved proofs avoid the use of assumptions such as certainty
equivalence. Robustifying terms avoid the need for persistency of excitation. Recent books
on NN feedback control include Refs. 15, 20, 22, 28, 31, and 82.

11.2 Approximate Dynamic Programming
Adaptive critics are reinforcement learning designs that attempt to approximate dynamic
programming. 83,84 They approach the optimal solution through forward approximate dynamic
programming. Initially, they were proposed by Werbos. 44 Overviews of the initial work in
NN control are provided by Miller et al. and the Handbook of Intelligent Control. How-
50
52
ard 46 showed the convergence of an algorithm relying on the successive policy iteration
solution of a nonlinear Lyapunov equation for the cost (value) and an optimizing equation
for the control (action). This algorithm relied on perfect knowledge of the system dynamics
and is an off-line technique. Later, various online dynamic-programming-based reinforcement
33
learning algorithms emerged and were mainly based on Werbos’s HDP, Sutton’s temporal
differences (TDs) learning methods, 85 and Q-learning, which was introduced by Watkins 47
and Werbos (called action-dependent critic schemes there). Critic and action network tuning
8
was provided by RLS, gradient techniques, or the backpropagation algorithm. Early work
17
on dynamic-programming-based reinforcement learning focused on discrete ﬁnite-state and
action spaces. These depended on lookup tables or linear function approximators. Conver-
gence results were shown in this case, such as Dayan. 86
For continuous-state and action spaces, convergence results are more challenging as
adaptive critics require the use of nonlinear function approximators. Four schemes for ap-
proximate dynamic programming were given in Ref. 33, the HDP and DHP algorithms and
their action-dependent versions (ADHDP and ADDHP). The linear quadratic regulation
(LQR) problem 37 served as a testbed for much of these studies. Solid convergence results
were obtained for various adaptive critic designs for the LQR problem. We mention the work
48
of Bradtke et al., where Q learning was shown to converge when using nonlinear function

823 824 825 826 827 828 829 830 831 832 833