Page 268 - A First Course In Stochastic Models
P. 268
VALUE-ITERATION ALGORITHM 261
An application of Theorem 6.2.1 now gives that
g i (R) ≥ m, i ∈ I.
∗
This inequality holds for each policy R and so g = min R g i (R) ≥ m proving the
first inequality in (6.6.3). The proof of the last inequality in (6.6.3) is very similar.
By the definition of policy R(v),
(T v) i = c i (R i (v)) + p ij (R i (v))v j , i ∈ I. (6.6.6)
j∈I
Define the upper bound
M = max {(T v) i − v i } .
i∈I
Since M ≥ (T v) i − v i for all i ∈ I, we obtain from (6.6.6) that
c i (R i (v)) − M + p ij (R i (v))v j ≤ v i , i ∈ I.
j∈I
Hence, by Theorem 6.2.1, g i (R(v)) ≤ M for all i ∈ I, proving the last inequality
in (6.6.3). This completes the proof.
We now formulate the value-iteration algorithm. In the formulation it is no
restriction to assume that
c i (a) > 0 for all i ∈ I and a ∈ A(i).
Otherwise, add a sufficiently large positive constant to each c i (a). This affects the
average cost of each policy by the same constant.
Value-iteration algorithm
Step 0 (initialization). Choose V 0 (i), i ∈ I with 0 ≤ V 0 (i) ≤ min a c i (a). Let
n := 1.
Step 1 (value-iteration step). For each state i ∈ I, compute
V n (i) = min c i (a) + p ij (a)V n−1 (j) .
a∈A(i)
j∈I
Let R(n) be any stationary policy such that the action a = R i (n) minimizes the
right-hand side of the equation for V n (i) for each state i.
Step 2 (bounds on the minimal costs). Compute the bounds
m n = min {V n (i) − V n−1 (i)} , M n = max {V n (i) − V n−1 (i)} .
i∈I i∈I