Page 267 - A First Course In Stochastic Models
P. 267
260 DISCRETE-TIME MARKOV DECISION PROCESSES
Intuitively, one might expect that the one-step difference V n (i)−V n−1 (i) will come
very close to the minimal average cost per time unit and that the stationary policy
whose actions minimize the right-hand side of (6.6.1) for all i will be very close
in cost to the minimal average cost. However, these matters appear to be rather
subtle for the average cost criterion due to the effect of possible periodicities in the
underlying decision processes. Before explaining this in more detail, we investigate
an operator which is induced by the recursion equation (6.6.1). The operator T
adds to each function v = (v i , i ∈ I) a function T v whose ith component (T v) i
is defined by
(T v) i = min c i (a) + p ij (a)v j , i ∈ I. (6.6.2)
a∈A(i)
j∈I
Note that (T v) i = V n (i) if v i = V n−1 (i), i ∈ I. The following theorem plays a
key role in the value-iteration algorithm.
Theorem 6.6.1 Suppose that the weak unichain assumption is satisfied. Let v =
(v i ) be given. Define the stationary policy R(v) as a policy which adds to each state
i ∈ I an action a = R i (v) that minimizes the right-hand side of (6.6.2). Then
∗
min {(T v) i − v i } ≤ g ≤ g s (R(v)) ≤ max {(T v) i − v i } (6.6.3)
i∈I i∈I
∗
for any s ∈ I, where g is the minimal long-run average cost per time unit and
g s (R(v)) denotes the long-run average cost per time unit under policy R(v) when
the initial state is s.
Proof To prove the first inequality, choose any stationary policy R. By the defi-
nition of (T v) i , we have for any state i ∈ I that
(T v) i ≤ c i (a) + p ij (a)v j , a ∈ A(i), (6.6.4)
j∈I
where the equality sign holds for a = R i (v). Choosing a = R i in (6.6.4) gives
(T v) i ≤ c i (R i ) + p ij (R i )v j , i ∈ I. (6.6.5)
i∈I
Define the lower bound
m = min{(T v) i − v i }.
i∈I
Since m ≤ (T v) i − v i for all i, it follows from (6.6.5) that m + v i ≤ c i (R i ) +
p ij (R i )v j for all i ∈ I, and so
j∈I
c i (R i ) − m + p ij (R i )v j ≥ v i , i ∈ I.
j∈I