Page 276 - A First Course In Stochastic Models

P. 276

CONVERGENCE PROOFS 269

where the strict inequality sign holds in the ﬁrst inequality for each i with R i = R i .
By (6.3.5) with R replaced by R and the fact that g(R) = g(R), we have for any
m ≥ 1 that
(m)
w i (R) = V m (i, R) − mg(R) + p (R)w j (R), i ∈ I.
ij
j∈I
Replacing w j (R) by w j (R) − {w j (R) − w j (R)}, we next ﬁnd that
(m)

V m (i, R) − mg(R) + p (R)w j (R)
ij
j∈I
(m)
= w i (R) + p (R){w j (R) − w j (R)}, i ∈ I and m ≥ 1.
ij
j∈I
Hence (6.7.5) can be rewritten as

w i (R) ≥ c i (R i ) − g(R) + p ij (R i )w j (R)
j∈I
(m)
≥ w i (R) + p (R){w j (R) − w j (R)}, i ∈ I and m ≥ 1,
ij
j∈I
where the strict inequality sign holds in the ﬁrst inequality for each i with R i = R i .
(m)
Using (6.7.4) and noting that p (R) → 0 as m → ∞ for j transient under R,
ij
it follows that w i (R) ≥ w i (R) for all i ∈ I with strict inequality for each i with
R i = R i . This completes the proof.

Convergence proof for value iteration
The proof of Theorem 6.6.2 is only given for the special case that the following
assumption is satisﬁed.

Strong aperiodicity assumption (i) for each stationary policy R the associated
Markov chain {X n } has no two disjoint closed sets;
(ii) p ii (a) > 0 for all i ∈ I and a ∈ A(i).
Note that assumption (ii) automatically holds when the data transformation from
Section 6.6 is applied to the original model.
We ﬁrst establish an important lemma about the chain structure of the product of
Markov matrices associated with the stationary policies. In this lemma the notation
P (f ) is used for the stochastic matrix (p ij (f (i))), i, j ∈ I associated with the
stationary policy f . The (i, j)th element of the matrix product P Q is denoted by
(P Q) ij .

Lemma 6.7.1 Suppose that the strong aperiodicity assumption holds. Let N be
the number of states of the Markov decision model. Then, for any two N-tuples

271 272 273 274 275 276 277 278 279 280 281