Page 270 - A First Course In Stochastic Models
P. 270

VALUE-ITERATION ALGORITHM                   263

                Theorem 6.6.3 In the standard value-iteration algorithm the lower and upper
                bounds satisfy
                             m k+1 ≥ m k and M k+1 ≤ M k for all k ≥ 1.

                Proof  By the definition of policy R(n),


                            V n (i) = c i (R i (n)) +  p ij (R i (n))V n−1 (j),  i ∈ I.  (6.6.8)
                                             j∈I

                In the same way as (6.6.5) was obtained, we find for any policy R that


                              c i (R i ) +  p ij (R i )V n−1 (j) ≥ V n (i),  i ∈ I.  (6.6.9)
                                      j∈I
                Taking n = k in (6.6.8) and taking n = k + 1 and R = R(k) in (6.6.9) gives


                     V k+1 (i) − V k (i) ≤  p ij (R i (k)){V k (j) − V k−1 (j)},  i ∈ I.  (6.6.10)
                                     j∈I

                Similarly, by taking n = k + 1 in (6.6.8) and taking n = k and R = R(k + 1) in
                (6.6.9), we find


                   V k+1 (i) − V k (i) ≥  p ij (R i (k + 1)) {V k (j) − V k−1 (j)} ,  i ∈ I.  (6.6.11)
                                   j∈I
                Since V k (j) − V k−1 (j) ≤ M k for all j ∈ I and    j∈I  p ij (R i (k)) = 1, it follows
                from (6.6.10) that V k+1 (i) − V k (i) ≤ M k for all i ∈ I. This gives M k+1 ≤ M k .
                Similarly, we obtain from (6.6.11) that m k+1 ≥ m k .

                Data transformation

                The periodicity issue can be circumvented by a perturbation of the one-step transi-
                tion probabilities. The perturbation technique is based on the following two obser-
                vations. First, a recurrent state allowing for a direct transition to itself must be
                aperiodic. Second, the relative frequencies at which the states of a Markov chain
                are visited do not change when the state changes are delayed with a constant factor
                and the probability of a self-transition is accordingly enlarged. In other words, if
                the one-step transition probabilities p ij of a Markov chain {X n } are perturbed as
                p = τp ij for j  = i and p = τp ii + 1 − τ for some constant τ with 0 < τ < 1,
                 ij                   ii
                the perturbed Markov chain {X n } with one-step transition probabilities p is aperi-
                                                                          ij
                odic and has the same equilibrium probabilities as the original Markov chain {X n }
                (verify). Thus a Markov decision model involving periodicities may be perturbed
                as follows. Choosing some constant τ with 0 < τ < 1, the state space, the action
                sets, the one-step costs and the one-step transition probabilities of the perturbed
   265   266   267   268   269   270   271   272   273   274   275