Page 269 - A First Course In Stochastic Models
P. 269
262 DISCRETE-TIME MARKOV DECISION PROCESSES
Step 3 (stopping test). If
0 ≤ M n − m n ≤ εm n
with ε > 0 a prespecified accuracy number (e.g. ε = 10 −3 ), stop with policy R(n).
Step 4 (continuation). n := n + 1 and repeat step 1.
By Theorem 6.6.1, we have
g i (R(n)) − g ∗ M n − m n
0 ≤ ≤ ≤ ε, i ∈ I (6.6.7)
g ∗ m n
when the algorithm is stopped after the nth iteration with policy R(n). In other
words, the average cost of policy R(n) cannot deviate more than 100ε% from
the theoretically minimal average cost when the bounds m n and M n satisfy 0 ≤
M n − m n ≤ εm n . In practical applications one is usually satisfied with a policy
whose average cost is sufficiently close to the theoretically minimal average cost.
Convergence of the bounds
The remaining question is whether the lower and upper bounds m n and M n converge
to the same limit so that the algorithm will be stopped after finitely many iterations.
The answer is yes only if a certain aperiodicity condition is satisfied. In general
m n and M n need not have the same limit, as the following example demonstrates.
Consider the trivial Markov decision problem with two states 1 and 2 and a single
action a 0 in each state. The one-step costs and the one-step transition probabilities
are given by c 1 (a 0 ) = 1, c 2 (a 0 ) = 0, p 12 (a 0 ) = p 21 (a 0 ) = 1 and p 11 (a 0 ) =
p 22 (a 0 ) = 0. Then the system cycles between the states 1 and 2. It is easily
verified that V 2k (1) = V 2k (2) = k, V 2k−1 (1) = k and V 2k−1 (2) = k − 1 for
all k ≥ 1. Hence m n = 0 and M n = 1 for all n, implying that the sequences
{m n } and {M n } have different limits. The reason for the oscillating behaviour of
V n (i) − V n−1 (i) is the periodicity of the Markov chain describing the state of the
system. The next theorem gives sufficient conditions for the convergence of the
value-iteration algorithm.
Theorem 6.6.2 Suppose that the weak unichain assumption holds and that for
each average cost optimal stationary policy the associated Markov chain {X n } is
aperiodic. Then there are finite constants α > 0 and 0 < β < 1 such that
n
|M n − m n | ≤ αβ , n ≥ 1.
∗
In particular, lim n→∞ M n = lim n→∞ m n = g .
A proof of this deep theorem will not be given. A special case of the theorem
will be proved in Section 6.7. This special case is related to the data transformation
by which the periodicity issue can be circumvented. Before discussing this data
transformation, we prove the interesting result that the sequences {m n } and {M n }
are always monotone irrespective of the chain structure of the Markov chains.