Page 276 - A First Course In Stochastic Models
P. 276

CONVERGENCE PROOFS                       269

                where the strict inequality sign holds in the first inequality for each i with R i  = R i .
                By (6.3.5) with R replaced by R and the fact that g(R) = g(R), we have for any
                m ≥ 1 that
                                                       (m)
                         w i (R) = V m (i, R) − mg(R) +  p  (R)w j (R),  i ∈ I.
                                                       ij
                                                   j∈I
                Replacing w j (R) by w j (R) − {w j (R) − w j (R)}, we next find that
                                          (m)

                    V m (i, R) − mg(R) +  p  (R)w j (R)
                                          ij
                                      j∈I
                                          (m)
                            = w i (R) +  p  (R){w j (R) − w j (R)},  i ∈ I and m ≥ 1.
                                          ij
                                     j∈I
                Hence (6.7.5) can be rewritten as

                     w i (R) ≥ c i (R i ) − g(R) +  p ij (R i )w j (R)
                                           j∈I
                                         (m)
                           ≥ w i (R) +  p  (R){w j (R) − w j (R)},  i ∈ I and m ≥ 1,
                                         ij
                                    j∈I
                where the strict inequality sign holds in the first inequality for each i with R i  = R i .
                                          (m)
                Using (6.7.4) and noting that p  (R) → 0 as m → ∞ for j transient under R,
                                          ij
                it follows that w i (R) ≥ w i (R) for all i ∈ I with strict inequality for each i with
                R i  = R i . This completes the proof.


                Convergence proof for value iteration
                The proof of Theorem 6.6.2 is only given for the special case that the following
                assumption is satisfied.

                Strong aperiodicity assumption  (i) for each stationary policy R the associated
                Markov chain {X n } has no two disjoint closed sets;
                  (ii) p ii (a) > 0 for all i ∈ I and a ∈ A(i).
                  Note that assumption (ii) automatically holds when the data transformation from
                Section 6.6 is applied to the original model.
                  We first establish an important lemma about the chain structure of the product of
                Markov matrices associated with the stationary policies. In this lemma the notation
                P (f ) is used for the stochastic matrix (p ij (f (i))), i, j ∈ I associated with the
                stationary policy f . The (i, j)th element of the matrix product P Q is denoted by
                (P Q) ij .

                Lemma 6.7.1 Suppose that the strong aperiodicity assumption holds. Let N be
                the number of states of the Markov decision model. Then, for any two N-tuples
   271   272   273   274   275   276   277   278   279   280   281