Page 277 - A First Course In Stochastic Models

P. 277

270 DISCRETE-TIME MARKOV DECISION PROCESSES

(f N , . . . , f 1 ) and (g N , . . . , g 1 ) of stationary policies and for any two states r and
s, there is some state j such that

[P (f N ) · · · P (f 1 )] rj > 0 and [P (g N ) · · · P (g 1 )] sj > 0. (6.7.6)

Proof Deﬁne for k = 1, . . . , N the sets S(k) and T (k) by

S(k) = {j ∈ I | [P (f k ) · · · P (f 1 )] rj > 0},
T (k) = {j ∈ I | [P (g k ) · · · P (g 1 )] sj > 0}.

Since p ii (a) > 0 for all j ∈ I and a ∈ A(i), we have

S(k + 1) ⊇ S(k) and T (k + 1) ⊇ T (k), k = 1, . . . , N − 1.

Assume now to the contrary that (6.7.6) does not hold. Then S(N)∩T (N) is empty.
In other words, S(N) and T (N) are disjoint sets with S(N) = I and T (N) = I.
Thus, since the sets S(k) and T (k) are non-decreasing, there are integers v and
w with 1 ≤ v, w < N such that S(v) = S(v + 1) and T (w) = T (w + 1). This
implies that the set S(v) of states is closed under policy f v+1 and the set T (w)
of states is closed under policy g w+1 . Since the sets S(N) and T (N) are disjoint
and S(N) ⊇ S(v) and T (N) ⊇ T (w), we have that the sets S(v) and T (w) are
disjoint. Construct now a stationary policy R with R i = f v+1 (i) for i ∈ S(v) and
R i = g w+1 (i) for i ∈ T (w). Then policy R has the two disjoint closed sets S(v)
and T (w). This contradicts the ﬁrst part of the strong aperiodicity assumption.
Hence the result (6.7.6) must hold.
Proof of Theorem 6.6.2 (under the strong aperiodicity assumption) We ﬁrst intro-
duce some notation. Let R(n) be any stationary policy for which the action R i (n)
minimizes the right-hand side of the value-iteration equation (6.6.1) for all i ∈ I.
Denote by P n the stochastic matrix whose (i, j)th element equals p ij (R i (n)) and
deﬁne the vector V n by V n = (V n (i), i ∈ I). By the proof of Theorem 6.6.3,

V n − V n−1 ≤ P n−1 (V n−1 − V n−2 ) and V n − V n−1 ≥ P n (V n−1 − V n−2 ).
(6.6.7)

Fix n ≥ 2. Since M n = V n (i 1 ) − V n−1 (i 1 ) and m n = V n (i 2 ) − V n−1 (i 2 ) for some
states i 1 and i 2 , we ﬁnd

M n − m n ≤ [P n−1 (V n−1 − V n−2 )](i 1 ) − [P n (V n−1 − V n−2 )](i 2 ).

Applying repeatedly the inequalities (6.6.7), we ﬁnd for any 1 ≤ k < n
M n − m n ≤ [P n−1 P n−2 · · · P n−k (V n−k − V n−k−1 )](i 1 )
−[P n P n−1 · · · P n−k+1 (V n−k − V n−k−1 )](i 2 ). (6.6.8)

272 273 274 275 276 277 278 279 280 281 282