Page 816 - Mechanical Engineers' Handbook (Volume 2)
P. 816
8 Reinforcement Learning Control Using NNs 807
mercial systems, there are usually available only certain restricted measurements of the plant.
In this output feedback case one may use an additional dynamic NN with its own internal
dynamics in the controller. The function of this additional NN is effectively to provide
estimates of the unmeasurable plant states, so that the dynamic NN functions as what is
known as an observer in control system theory.
The issues of observer design using NNs can be appreciated with rigid robotic systems. 12
For these systems, the dynamics can be written in state-variable form as
˙ x x 2
1
1
˙ x M (x )[ N(x , x ) ]
1
1
2
2
where x q, x ˙q and the nonlinear function N(x , x ) V (x , x )x G(x ) F(x )is
2 1 2 m 1 2 2 1 2
assumed to be unknown. It can be shown 31 that the following dynamic NN observer can
provide estimates of the entire state x [x T 1 x ] [q T ˙ q ] given measurements of only
T T
T T
2
x (t) q(t):
1
˙
ˆ x ˆx kx
1
D 1
2
˙ 1 ˆ T
ˆ z M (x )[ W (ˆx) k ˜x ].
o
2
P 1
1
o
ˆ x ˆz k ˜x
2
2
P21
In this system, the hat denotes estimates and the tilde denotes estimation errors. It is assumed
that the inertia matrix M(q) is known, but all other nonlinearities are estimated by the ob-
ˆ
ˆ
server NN W (ˆx ), which has output layer weights W o and activation functions ( ). Signal
T
o
o
o
v (t) is a certain observer robustifying term, and the observer gains k , k , k P2 are positive
o
P
D
design constants detailed in Ref. 31.
The NN output feedback tracking controller shown in Fig. 14 uses the dynamic NN
observer to reconstruct the missing measurements x (t) ˙q(t) and then employs a second
2
static NN for tracking control, exactly as in Fig. 4. Note that the outer tracking PD loop
structure has been retained but an additional dynamic NN loop is needed. In Ref. 31, weight-
tuning algorithms that guarantee stability are given for both the dynamic estimator NN and
the static control NN.
8 REINFORCEMENT LEARNING CONTROL USING NNs
Reinforcement learning techniques are based on psychological precepts of reward and pun-
ishment as used by I. P. Pavlov in the training of dogs at the turn of the century. The key
tenet here is that the performance indicators of the controlled system should be simple, for
instance, 1 for a successful trial and 1 for a failure, and that these simple signals should
tune or adapt a NN controller so that its performance improves over time. This gives a
learning feature driven by the basic success or failure record of the controlled system. Re-
inforcement learning has been studied by many researchers, including Refs. 32 and 33.
It is difficult to provide rigorous designs and analysis for reinforcement learning in the
framework of standard control system theory since the reinforcement signal has reduced
information, which makes study, including Lyapunov techniques, very complicated. Rein-
forcement learning is related to the so-called sign error tuning in adaptive control 34 which
has not been proven to yield stability.

