Page 251 - A First Course In Stochastic Models

P. 251

244 DISCRETE-TIME MARKOV DECISION PROCESSES

Note, as a consequence of (6.3.1), the normalization

w r (R) = 0.
In accordance with the heuristic result (6.2.9), the next theorem shows that the
average cost g = g(R) and the relative values υ i = w i (R), i ∈ I satisfy a system
of linear equations.

Theorem 6.3.1 Let R be a given stationary policy such that the associated Markov
chain {X n } has no two disjoint closed sets. Then

(a) The average cost g(R) and the relative values w i (R), i ∈ I, satisfy the following
system of linear equations in the unknowns g and υ i , i ∈ I:

υ i = c i (R i ) − g + p ij (R i )υ j , i ∈ I. (6.3.2)
j∈I

(b) Let the numbers g and υ i , i ∈ I, be any solution to (6.3.2). Then

g = g(R)

and, for some constant c,
υ i = w i (R) + c, i ∈ I.

(c) Let s be an arbitrarily chosen state. Then the linear equations (6.3.2) together
with the normalization equation υ s = 0 have a unique solution.

Proof (a) By conditioning on the next state following the initial state i, it can be
seen that

T i (R) = 1 + p ij (R i )T j (R), i ∈ I,
j =r

K i (R) = c i (R i ) + p ij (R i )K j (R), i ∈ I.
j =r

This implies that

K i (R) − g(R)T i (R) = c i (R i ) − g(R) + p ij (R i ){K j (R) − g(R)T j (R)}.
j =r
Hence, by w r (R) = 0, we ﬁnd

w i (R) = c i (R i ) − g(R) + p ij (R i )w j (R), i ∈ I.
j∈I

246 247 248 249 250 251 252 253 254 255 256