Page 269 - A First Course In Stochastic Models
P. 269

262             DISCRETE-TIME MARKOV DECISION PROCESSES

                Step 3 (stopping test). If

                                         0 ≤ M n − m n ≤ εm n
                with ε > 0 a prespecified accuracy number (e.g. ε = 10 −3 ), stop with policy R(n).
                Step 4 (continuation). n := n + 1 and repeat step 1.

                  By Theorem 6.6.1, we have
                                   g i (R(n)) − g ∗  M n − m n
                               0 ≤             ≤          ≤ ε,  i ∈ I        (6.6.7)
                                       g ∗          m n
                when the algorithm is stopped after the nth iteration with policy R(n). In other
                words, the average cost of policy R(n) cannot deviate more than 100ε% from
                the theoretically minimal average cost when the bounds m n and M n satisfy 0 ≤
                M n − m n ≤ εm n . In practical applications one is usually satisfied with a policy
                whose average cost is sufficiently close to the theoretically minimal average cost.

                Convergence of the bounds

                The remaining question is whether the lower and upper bounds m n and M n converge
                to the same limit so that the algorithm will be stopped after finitely many iterations.
                The answer is yes only if a certain aperiodicity condition is satisfied. In general
                m n and M n need not have the same limit, as the following example demonstrates.
                Consider the trivial Markov decision problem with two states 1 and 2 and a single
                action a 0 in each state. The one-step costs and the one-step transition probabilities
                are given by c 1 (a 0 ) = 1, c 2 (a 0 ) = 0, p 12 (a 0 ) = p 21 (a 0 ) = 1 and p 11 (a 0 ) =
                p 22 (a 0 ) = 0. Then the system cycles between the states 1 and 2. It is easily
                verified that V 2k (1) = V 2k (2) = k, V 2k−1 (1) = k and V 2k−1 (2) = k − 1 for
                all k ≥ 1. Hence m n = 0 and M n = 1 for all n, implying that the sequences
                {m n } and {M n } have different limits. The reason for the oscillating behaviour of
                V n (i) − V n−1 (i) is the periodicity of the Markov chain describing the state of the
                system. The next theorem gives sufficient conditions for the convergence of the
                value-iteration algorithm.
                Theorem 6.6.2 Suppose that the weak unichain assumption holds and that for
                each average cost optimal stationary policy the associated Markov chain {X n } is
                aperiodic. Then there are finite constants α > 0 and 0 < β < 1 such that
                                                    n
                                      |M n − m n | ≤ αβ ,  n ≥ 1.
                                                     ∗
                In particular, lim n→∞ M n = lim n→∞ m n = g .
                  A proof of this deep theorem will not be given. A special case of the theorem
                will be proved in Section 6.7. This special case is related to the data transformation
                by which the periodicity issue can be circumvented. Before discussing this data
                transformation, we prove the interesting result that the sequences {m n } and {M n }
                are always monotone irrespective of the chain structure of the Markov chains.
   264   265   266   267   268   269   270   271   272   273   274