Page 269 - A First Course In Stochastic Models

P. 269

262 DISCRETE-TIME MARKOV DECISION PROCESSES

Step 3 (stopping test). If

0 ≤ M n − m n ≤ εm n
with ε > 0 a prespeciﬁed accuracy number (e.g. ε = 10 −3 ), stop with policy R(n).
Step 4 (continuation). n := n + 1 and repeat step 1.

By Theorem 6.6.1, we have
g i (R(n)) − g ∗ M n − m n
0 ≤ ≤ ≤ ε, i ∈ I (6.6.7)
g ∗ m n
when the algorithm is stopped after the nth iteration with policy R(n). In other
words, the average cost of policy R(n) cannot deviate more than 100ε% from
the theoretically minimal average cost when the bounds m n and M n satisfy 0 ≤
M n − m n ≤ εm n . In practical applications one is usually satisﬁed with a policy
whose average cost is sufﬁciently close to the theoretically minimal average cost.

Convergence of the bounds

The remaining question is whether the lower and upper bounds m n and M n converge
to the same limit so that the algorithm will be stopped after ﬁnitely many iterations.
The answer is yes only if a certain aperiodicity condition is satisﬁed. In general
m n and M n need not have the same limit, as the following example demonstrates.
Consider the trivial Markov decision problem with two states 1 and 2 and a single
action a 0 in each state. The one-step costs and the one-step transition probabilities
are given by c 1 (a 0 ) = 1, c 2 (a 0 ) = 0, p 12 (a 0 ) = p 21 (a 0 ) = 1 and p 11 (a 0 ) =
p 22 (a 0 ) = 0. Then the system cycles between the states 1 and 2. It is easily
veriﬁed that V 2k (1) = V 2k (2) = k, V 2k−1 (1) = k and V 2k−1 (2) = k − 1 for
all k ≥ 1. Hence m n = 0 and M n = 1 for all n, implying that the sequences
{m n } and {M n } have different limits. The reason for the oscillating behaviour of
V n (i) − V n−1 (i) is the periodicity of the Markov chain describing the state of the
system. The next theorem gives sufﬁcient conditions for the convergence of the
value-iteration algorithm.
Theorem 6.6.2 Suppose that the weak unichain assumption holds and that for
each average cost optimal stationary policy the associated Markov chain {X n } is
aperiodic. Then there are ﬁnite constants α > 0 and 0 < β < 1 such that
n
|M n − m n | ≤ αβ , n ≥ 1.
∗
In particular, lim n→∞ M n = lim n→∞ m n = g .
A proof of this deep theorem will not be given. A special case of the theorem
will be proved in Section 6.7. This special case is related to the data transformation
by which the periodicity issue can be circumvented. Before discussing this data
transformation, we prove the interesting result that the sequences {m n } and {M n }
are always monotone irrespective of the chain structure of the Markov chains.

264 265 266 267 268 269 270 271 272 273 274