Page 276 - A First Course In Stochastic Models
P. 276
CONVERGENCE PROOFS 269
where the strict inequality sign holds in the first inequality for each i with R i = R i .
By (6.3.5) with R replaced by R and the fact that g(R) = g(R), we have for any
m ≥ 1 that
(m)
w i (R) = V m (i, R) − mg(R) + p (R)w j (R), i ∈ I.
ij
j∈I
Replacing w j (R) by w j (R) − {w j (R) − w j (R)}, we next find that
(m)
V m (i, R) − mg(R) + p (R)w j (R)
ij
j∈I
(m)
= w i (R) + p (R){w j (R) − w j (R)}, i ∈ I and m ≥ 1.
ij
j∈I
Hence (6.7.5) can be rewritten as
w i (R) ≥ c i (R i ) − g(R) + p ij (R i )w j (R)
j∈I
(m)
≥ w i (R) + p (R){w j (R) − w j (R)}, i ∈ I and m ≥ 1,
ij
j∈I
where the strict inequality sign holds in the first inequality for each i with R i = R i .
(m)
Using (6.7.4) and noting that p (R) → 0 as m → ∞ for j transient under R,
ij
it follows that w i (R) ≥ w i (R) for all i ∈ I with strict inequality for each i with
R i = R i . This completes the proof.
Convergence proof for value iteration
The proof of Theorem 6.6.2 is only given for the special case that the following
assumption is satisfied.
Strong aperiodicity assumption (i) for each stationary policy R the associated
Markov chain {X n } has no two disjoint closed sets;
(ii) p ii (a) > 0 for all i ∈ I and a ∈ A(i).
Note that assumption (ii) automatically holds when the data transformation from
Section 6.6 is applied to the original model.
We first establish an important lemma about the chain structure of the product of
Markov matrices associated with the stationary policies. In this lemma the notation
P (f ) is used for the stochastic matrix (p ij (f (i))), i, j ∈ I associated with the
stationary policy f . The (i, j)th element of the matrix product P Q is denoted by
(P Q) ij .
Lemma 6.7.1 Suppose that the strong aperiodicity assumption holds. Let N be
the number of states of the Markov decision model. Then, for any two N-tuples