Page 251 - A First Course In Stochastic Models
P. 251
244 DISCRETE-TIME MARKOV DECISION PROCESSES
Note, as a consequence of (6.3.1), the normalization
w r (R) = 0.
In accordance with the heuristic result (6.2.9), the next theorem shows that the
average cost g = g(R) and the relative values υ i = w i (R), i ∈ I satisfy a system
of linear equations.
Theorem 6.3.1 Let R be a given stationary policy such that the associated Markov
chain {X n } has no two disjoint closed sets. Then
(a) The average cost g(R) and the relative values w i (R), i ∈ I, satisfy the following
system of linear equations in the unknowns g and υ i , i ∈ I:
υ i = c i (R i ) − g + p ij (R i )υ j , i ∈ I. (6.3.2)
j∈I
(b) Let the numbers g and υ i , i ∈ I, be any solution to (6.3.2). Then
g = g(R)
and, for some constant c,
υ i = w i (R) + c, i ∈ I.
(c) Let s be an arbitrarily chosen state. Then the linear equations (6.3.2) together
with the normalization equation υ s = 0 have a unique solution.
Proof (a) By conditioning on the next state following the initial state i, it can be
seen that
T i (R) = 1 + p ij (R i )T j (R), i ∈ I,
j =r
K i (R) = c i (R i ) + p ij (R i )K j (R), i ∈ I.
j =r
This implies that
K i (R) − g(R)T i (R) = c i (R i ) − g(R) + p ij (R i ){K j (R) − g(R)T j (R)}.
j =r
Hence, by w r (R) = 0, we find
w i (R) = c i (R i ) − g(R) + p ij (R i )w j (R), i ∈ I.
j∈I