Page 275 - A First Course In Stochastic Models
P. 275
268 DISCRETE-TIME MARKOV DECISION PROCESSES
(a) g(R) < g(R), or
(b) g(R) = g(R) and w i (R) ≤ w i (R) for all i ∈ I with strict inequality for at
least one state i.
That is, each iteration either reduces the cost rate or else reduces the relative value
of a (transient) state. Since the number of possible stationary policies is finite, this
assertion implies that the algorithm converges after finitely many iterations. To
prove the assertion, the starting point is the relation
c i (R i ) − g(R) + p ij (R i )w j (R) ≤ w i (R), i ∈ I, (6.7.1)
j∈I
with strict inequality only for those states i with R i = R i . This relation is an
immediate consequence of the construction of policy R. By Theorem 6.2.1 and
(6.7.1), we have g(R) ≤ g(R). The strict inequality g(R) < g(R) holds only if the
strict inequality holds in (6.7.1) for some state i that is recurrent under the new
policy R.
Consider now the case of g(R) = g(R). Then it is true that the equality sign holds
in (6.7.1) for all i ∈ I (R). Thus, by the convention made in the policy-improvement
step,
R i = R i , i ∈ I (R). (6.7.2)
This implies that
I (R) = I (R), (6.7.3)
since the set I (R) is closed under policy R and any two states in I (R) communicate
under policy R. In its turn (6.7.3) implies that
w j (R) = w j (R), j ∈ I (R). (6.7.4)
This can be seen as follows. From the definition (6.3.1) of the relative values and
the fact that the set of recurrent states is a closed set, it follows that for any policy
Q the relative values for the recurrent states i ∈ I (Q) do not depend on the actions
in the transient states i /∈ I (Q). In view of the convention to take the largest state in
I (Q) as the reference state for the definition of the relative value function w i (Q),
it follows from (6.7.2) and (6.7.3) that (6.7.4) holds. The remainder of the proof is
now easy. Proceeding in the same way as in the derivation of (6.3.3), we find by
iterating the inequality (6.7.1) that
w i (R) ≥ c i (R i ) − g(R) + p ij (R i )w j (R) (6.7.5)
j∈I
(m)
≥ V m (i, R) − mg(R) + p (R)w j (R), i ∈ I and m ≥ 1,
ij
j∈I