Page 268 - A First Course In Stochastic Models
P. 268

VALUE-ITERATION ALGORITHM                   261

                An application of Theorem 6.2.1 now gives that

                                         g i (R) ≥ m,  i ∈ I.
                                                        ∗
                This inequality holds for each policy R and so g = min R g i (R) ≥ m proving the
                first inequality in (6.6.3). The proof of the last inequality in (6.6.3) is very similar.
                By the definition of policy R(v),


                              (T v) i = c i (R i (v)) +  p ij (R i (v))v j ,  i ∈ I.  (6.6.6)
                                                j∈I
                Define the upper bound

                                        M = max {(T v) i − v i } .
                                             i∈I
                Since M ≥ (T v) i − v i for all i ∈ I, we obtain from (6.6.6) that


                             c i (R i (v)) − M +  p ij (R i (v))v j ≤ v i ,  i ∈ I.
                                            j∈I
                Hence, by Theorem 6.2.1, g i (R(v)) ≤ M for all i ∈ I, proving the last inequality
                in (6.6.3). This completes the proof.

                  We now formulate the value-iteration algorithm. In the formulation it is no
                restriction to assume that

                                  c i (a) > 0 for all i ∈ I and a ∈ A(i).

                Otherwise, add a sufficiently large positive constant to each c i (a). This affects the
                average cost of each policy by the same constant.


                Value-iteration algorithm
                Step 0 (initialization). Choose V 0 (i), i ∈ I with 0 ≤ V 0 (i) ≤ min a c i (a). Let
                n := 1.
                Step 1 (value-iteration step). For each state i ∈ I, compute

                                                                
                                                                

                               V n (i) = min  c i (a) +  p ij (a)V n−1 (j) .
                                      a∈A(i)                    
                                                   j∈I
                Let R(n) be any stationary policy such that the action a = R i (n) minimizes the
                right-hand side of the equation for V n (i) for each state i.
                Step 2 (bounds on the minimal costs). Compute the bounds
                        m n = min {V n (i) − V n−1 (i)} ,  M n = max {V n (i) − V n−1 (i)} .
                             i∈I                        i∈I
   263   264   265   266   267   268   269   270   271   272   273