Page 275 - A First Course In Stochastic Models

P. 275

268 DISCRETE-TIME MARKOV DECISION PROCESSES

(a) g(R) < g(R), or
(b) g(R) = g(R) and w i (R) ≤ w i (R) for all i ∈ I with strict inequality for at
least one state i.

That is, each iteration either reduces the cost rate or else reduces the relative value
of a (transient) state. Since the number of possible stationary policies is ﬁnite, this
assertion implies that the algorithm converges after ﬁnitely many iterations. To
prove the assertion, the starting point is the relation

c i (R i ) − g(R) + p ij (R i )w j (R) ≤ w i (R), i ∈ I, (6.7.1)
j∈I

with strict inequality only for those states i with R i = R i . This relation is an
immediate consequence of the construction of policy R. By Theorem 6.2.1 and
(6.7.1), we have g(R) ≤ g(R). The strict inequality g(R) < g(R) holds only if the
strict inequality holds in (6.7.1) for some state i that is recurrent under the new
policy R.
Consider now the case of g(R) = g(R). Then it is true that the equality sign holds
in (6.7.1) for all i ∈ I (R). Thus, by the convention made in the policy-improvement
step,
R i = R i , i ∈ I (R). (6.7.2)
This implies that

I (R) = I (R), (6.7.3)

since the set I (R) is closed under policy R and any two states in I (R) communicate
under policy R. In its turn (6.7.3) implies that

w j (R) = w j (R), j ∈ I (R). (6.7.4)

This can be seen as follows. From the deﬁnition (6.3.1) of the relative values and
the fact that the set of recurrent states is a closed set, it follows that for any policy
Q the relative values for the recurrent states i ∈ I (Q) do not depend on the actions
in the transient states i /∈ I (Q). In view of the convention to take the largest state in
I (Q) as the reference state for the deﬁnition of the relative value function w i (Q),
it follows from (6.7.2) and (6.7.3) that (6.7.4) holds. The remainder of the proof is
now easy. Proceeding in the same way as in the derivation of (6.3.3), we ﬁnd by
iterating the inequality (6.7.1) that

w i (R) ≥ c i (R i ) − g(R) + p ij (R i )w j (R) (6.7.5)
j∈I
(m)
≥ V m (i, R) − mg(R) + p (R)w j (R), i ∈ I and m ≥ 1,
ij
j∈I

270 271 272 273 274 275 276 277 278 279 280