Page 277 - A First Course In Stochastic Models
P. 277

270             DISCRETE-TIME MARKOV DECISION PROCESSES

                (f N , . . . , f 1 ) and (g N , . . . , g 1 ) of stationary policies and for any two states r and
                s, there is some state j such that

                       [P (f N ) · · · P (f 1 )] rj > 0 and [P (g N ) · · · P (g 1 )] sj > 0.  (6.7.6)

                Proof  Define for k = 1, . . . , N the sets S(k) and T (k) by

                                S(k) = {j ∈ I | [P (f k ) · · · P (f 1 )] rj > 0},
                                T (k) = {j ∈ I | [P (g k ) · · · P (g 1 )] sj > 0}.

                Since p ii (a) > 0 for all j ∈ I and a ∈ A(i), we have

                       S(k + 1) ⊇ S(k)  and T (k + 1) ⊇ T (k),  k = 1, . . . , N − 1.

                Assume now to the contrary that (6.7.6) does not hold. Then S(N)∩T (N) is empty.
                In other words, S(N) and T (N) are disjoint sets with S(N)  = I and T (N)  = I.
                Thus, since the sets S(k) and T (k) are non-decreasing, there are integers v and
                w with 1 ≤ v, w < N such that S(v) = S(v + 1) and T (w) = T (w + 1). This
                implies that the set S(v) of states is closed under policy f v+1 and the set T (w)
                of states is closed under policy g w+1 . Since the sets S(N) and T (N) are disjoint
                and S(N) ⊇ S(v) and T (N) ⊇ T (w), we have that the sets S(v) and T (w) are
                disjoint. Construct now a stationary policy R with R i = f v+1 (i) for i ∈ S(v) and
                R i = g w+1 (i) for i ∈ T (w). Then policy R has the two disjoint closed sets S(v)
                and T (w). This contradicts the first part of the strong aperiodicity assumption.
                Hence the result (6.7.6) must hold.
                Proof of Theorem 6.6.2 (under the strong aperiodicity assumption) We first intro-
                duce some notation. Let R(n) be any stationary policy for which the action R i (n)
                minimizes the right-hand side of the value-iteration equation (6.6.1) for all i ∈ I.
                Denote by P n the stochastic matrix whose (i, j)th element equals p ij (R i (n)) and
                define the vector V n by V n = (V n (i), i ∈ I). By the proof of Theorem 6.6.3,

                   V n − V n−1 ≤ P n−1 (V n−1 − V n−2 )  and V n − V n−1 ≥ P n (V n−1 − V n−2 ).
                                                                             (6.6.7)

                Fix n ≥ 2. Since M n = V n (i 1 ) − V n−1 (i 1 ) and m n = V n (i 2 ) − V n−1 (i 2 ) for some
                states i 1 and i 2 , we find

                       M n − m n ≤ [P n−1 (V n−1 − V n−2 )](i 1 ) − [P n (V n−1 − V n−2 )](i 2 ).

                Applying repeatedly the inequalities (6.6.7), we find for any 1 ≤ k < n
                         M n − m n ≤ [P n−1 P n−2 · · · P n−k (V n−k − V n−k−1 )](i 1 )
                                     −[P n P n−1 · · · P n−k+1 (V n−k − V n−k−1 )](i 2 ).  (6.6.8)
   272   273   274   275   276   277   278   279   280   281   282