Page 275 - A First Course In Stochastic Models
P. 275

268             DISCRETE-TIME MARKOV DECISION PROCESSES

                (a) g(R) < g(R), or
                (b) g(R) = g(R) and w i (R) ≤ w i (R) for all i ∈ I with strict inequality for at
                   least one state i.

                That is, each iteration either reduces the cost rate or else reduces the relative value
                of a (transient) state. Since the number of possible stationary policies is finite, this
                assertion implies that the algorithm converges after finitely many iterations. To
                prove the assertion, the starting point is the relation

                           c i (R i ) − g(R) +  p ij (R i )w j (R) ≤ w i (R),  i ∈ I,  (6.7.1)
                                         j∈I

                with strict inequality only for those states i with R i  = R i . This relation is an
                immediate consequence of the construction of policy R. By Theorem 6.2.1 and
                (6.7.1), we have g(R) ≤ g(R). The strict inequality g(R) < g(R) holds only if the
                strict inequality holds in (6.7.1) for some state i that is recurrent under the new
                policy R.
                  Consider now the case of g(R) = g(R). Then it is true that the equality sign holds
                in (6.7.1) for all i ∈ I (R). Thus, by the convention made in the policy-improvement
                step,
                                        R i = R i ,  i ∈ I (R).              (6.7.2)
                This implies that

                                           I (R) = I (R),                    (6.7.3)

                since the set I (R) is closed under policy R and any two states in I (R) communicate
                under policy R. In its turn (6.7.3) implies that

                                     w j (R) = w j (R),  j ∈ I (R).          (6.7.4)

                This can be seen as follows. From the definition (6.3.1) of the relative values and
                the fact that the set of recurrent states is a closed set, it follows that for any policy
                Q the relative values for the recurrent states i ∈ I (Q) do not depend on the actions
                in the transient states i /∈ I (Q). In view of the convention to take the largest state in
                I (Q) as the reference state for the definition of the relative value function w i (Q),
                it follows from (6.7.2) and (6.7.3) that (6.7.4) holds. The remainder of the proof is
                now easy. Proceeding in the same way as in the derivation of (6.3.3), we find by
                iterating the inequality (6.7.1) that


                     w i (R) ≥ c i (R i ) − g(R) +  p ij (R i )w j (R)       (6.7.5)
                                          j∈I
                                                  (m)
                          ≥ V m (i, R) − mg(R) +  p  (R)w j (R),  i ∈ I and m ≥ 1,
                                                  ij
                                              j∈I
   270   271   272   273   274   275   276   277   278   279   280