Page 828 - Mechanical Engineers' Handbook (Volume 2)
P. 828

11 Historical Development, Referenced Work, and Further Study  819

                           control aircraft and missiles. Feedback linearization using NNs has been addressed by Chen
                                                      70
                                    59
                           and Khalil, Yesildirek and Lewis, Ge et al., and others. NNs were used with backstep-
                                                                22
                           ping 25  by Lewis, et al., 15  Arslan and Basar, Wang and Huang, Ge et al., 20,21  and others.
                                                             71
                                                                             72
                              NNs have been used in conjunction with the Isidori–Byrnes regulator equations for
                           output-tracking control by Wang and Huang. A multimodel NN control approach has been
                                                              72
                           given by Narendra and Balakrishnan. 73  Applications of NN control have been extended to
                           partial differential equation systems by Padhi et al. 74  NNs have been used for control of
                           stochastic systems by Poznyak and Ljung. Parisini and co-workers have developed receding
                                                           75
                           horizon controllers based on NNs and hybrid discrete-event NN controllers. 77
                                                     76
                              In practical implementations of NN controllers there remain problems to overcome.
                           Weight initialization still remains an issue, and one may also find that the NN weights
                           become unbounded despite proofs to the contrary. Practical implementation issues were ad-
                           dressed by Chen and Chang, 78  Gutierrez and Lewis, 79  and others. Random initialization of
                           the first-layer NN weights often works in practice, and work by Igelnik and Pao shows that
                                                                                         7
                           it is theoretically defensible. Computational complexity makes NNs with many hidden layer
                           neurons difficult to implement. Recently, work has intensified in wavelets, NNs that have
                           localized basis functions, and NNs that are self-organizing in the sense of adding or deleting
                           neurons automatically. 36,80,81
                              By now it is understood that NNs offer an elegant extension of adaptive control and
                           other techniques to systems that are nonlinear in the unknown parameters. The universal
                           approximation properties of NNs 2,3  avoid the use of specialized basis sets, including regres-
                           sion matrices. Formalized improved proofs avoid the use of assumptions such as certainty
                           equivalence. Robustifying terms avoid the need for persistency of excitation. Recent books
                           on NN feedback control include Refs. 15, 20, 22, 28, 31, and 82.


            11.2  Approximate Dynamic Programming
                           Adaptive critics are reinforcement learning designs that attempt to approximate dynamic
                           programming. 83,84  They approach the optimal solution through forward approximate dynamic
                           programming. Initially, they were proposed by Werbos. 44  Overviews of the initial work in
                           NN control are provided by Miller et al. and the Handbook of Intelligent Control. How-
                                                           50
                                                                                             52
                           ard 46  showed the convergence of an algorithm relying on the successive policy iteration
                           solution of a nonlinear Lyapunov equation for the cost (value) and an optimizing equation
                           for the control (action). This algorithm relied on perfect knowledge of the system dynamics
                           and is an off-line technique. Later, various online dynamic-programming-based reinforcement
                                                                                   33
                           learning algorithms emerged and were mainly based on Werbos’s HDP, Sutton’s temporal
                           differences (TDs) learning methods, 85  and Q-learning, which was introduced by Watkins 47
                           and Werbos (called action-dependent critic schemes there). Critic and action network tuning
                                    8
                           was provided by RLS, gradient techniques, or the backpropagation algorithm. Early work
                                                                                        17
                           on dynamic-programming-based reinforcement learning focused on discrete finite-state and
                           action spaces. These depended on lookup tables or linear function approximators. Conver-
                           gence results were shown in this case, such as Dayan. 86
                              For continuous-state and action spaces, convergence results are more challenging as
                           adaptive critics require the use of nonlinear function approximators. Four schemes for ap-
                           proximate dynamic programming were given in Ref. 33, the HDP and DHP algorithms and
                           their action-dependent versions (ADHDP and ADDHP). The linear quadratic regulation
                           (LQR) problem 37  served as a testbed for much of these studies. Solid convergence results
                           were obtained for various adaptive critic designs for the LQR problem. We mention the work
                                        48
                           of Bradtke et al., where Q learning was shown to converge when using nonlinear function
   823   824   825   826   827   828   829   830   831   832   833