Page 268 - A First Course In Stochastic Models

P. 268

VALUE-ITERATION ALGORITHM 261

An application of Theorem 6.2.1 now gives that

g i (R) ≥ m, i ∈ I.
∗
This inequality holds for each policy R and so g = min R g i (R) ≥ m proving the
ﬁrst inequality in (6.6.3). The proof of the last inequality in (6.6.3) is very similar.
By the deﬁnition of policy R(v),

(T v) i = c i (R i (v)) + p ij (R i (v))v j , i ∈ I. (6.6.6)
j∈I
Deﬁne the upper bound

M = max {(T v) i − v i } .
i∈I
Since M ≥ (T v) i − v i for all i ∈ I, we obtain from (6.6.6) that

c i (R i (v)) − M + p ij (R i (v))v j ≤ v i , i ∈ I.
j∈I
Hence, by Theorem 6.2.1, g i (R(v)) ≤ M for all i ∈ I, proving the last inequality
in (6.6.3). This completes the proof.

We now formulate the value-iteration algorithm. In the formulation it is no
restriction to assume that

c i (a) > 0 for all i ∈ I and a ∈ A(i).

Otherwise, add a sufﬁciently large positive constant to each c i (a). This affects the
average cost of each policy by the same constant.

Value-iteration algorithm
Step 0 (initialization). Choose V 0 (i), i ∈ I with 0 ≤ V 0 (i) ≤ min a c i (a). Let
n := 1.
Step 1 (value-iteration step). For each state i ∈ I, compute

 
 

V n (i) = min c i (a) + p ij (a)V n−1 (j) .
a∈A(i)  
j∈I
Let R(n) be any stationary policy such that the action a = R i (n) minimizes the
right-hand side of the equation for V n (i) for each state i.
Step 2 (bounds on the minimal costs). Compute the bounds
m n = min {V n (i) − V n−1 (i)} , M n = max {V n (i) − V n−1 (i)} .
i∈I i∈I

263 264 265 266 267 268 269 270 271 272 273