Page 272 - A First Course In Stochastic Models
P. 272

VALUE-ITERATION ALGORITHM                   265

                where R u and R v are the actions which are prescribed by policy R(n) in the states
                u and v.
                  Step 4(b). For each i ∈ I, change V n (i) according to

                                 V n (i) := V n−1 (i) + ω{V n (i) − V n−1 (i)}.
                Step 4(c). n := n + 1 and go to step 1.
                  In the case of a tie when selecting in step 4(a) the state u for which the minimum
                in m n is obtained, it is conventional to choose the minimizing state of the previous
                iteration when that state is one of the candidates to choose; otherwise, choose the
                first state achieving the minimum in m n . The same convention is used for the
                maximizing action v in M n .
                  The choice of the dynamic relaxation factor ω is motivated as follows. We change
                the estimate V n (i) as V n (i) = V n−1 (i) + ω{V n (i) − V n−1 (i)} for all i in order to
                accomplish at the (n + 1)th iteration that

                 c u (R u ) +  p uj (R u )V n (j) − V n (u) = c v (R v ) +  p vj (R v )V n (j) − V n (v),
                         j∈I                              j∈I
                in the implicit hope that the difference between the new upper and lower bounds
                M n+1 and m n+1 will decrease more quickly. Using the relation m n = V n (u) −

                V n−1 (u) = c u (R u )+  j  p uj (R u )V n−1 (j)−V n−1 (u) and the similar relation for M n ,
                it is a matter of simple algebra to verify from the above condition the expression
                for ω. We omit the easy proof that ω > 0. Numerical experiments indicate that
                using a dynamic relaxation factor in value iteration often greatly enhances the
                speed of convergence of the algorithm. The modified value-iteration algorithm is
                theoretically not guaranteed to converge, but in practice the algorithm will usually
                work very well. It is important to note that the relaxation factor ω is kept outside
                the recursion equation in step 1 so that the bounds m n and M n in step 2 are not
                destroyed. Although the bounds apply, it is no longer true that the sequences {m n }
                and {M n } are monotonic.
                  To conclude this section, we apply value iteration to two examples. The first
                example concerns the maintenance problem from Example 6.1.1 and the second
                example illustrates the usefulness of value iteration for the computation of perfor-
                mance measures for a single Markov chain.

                Example 6.1.1 (continued) A maintenance problem
                For the maintenance problem the recursion equation (6.6.1) becomes

                                   N

                       V n (1) = 0 +  q 1j V n−1 (j),
                                  j=1
                                                                
                                        N
                                                                

                        V n (i) = min 0 +  q ij V n−1 (j), C pi + V n−1 (1) ,  1 < i < N,
                                                                
                                       j=i
   267   268   269   270   271   272   273   274   275   276   277