Page 271 - A First Course In Stochastic Models
P. 271

264             DISCRETE-TIME MARKOV DECISION PROCESSES

                Markov decision model are defined by
                             I = I,
                           A(i) = A(i),  i ∈ I,

                          c i (a) = c i (a),  a ∈ A(i) and i ∈ I,

                                  τp ij (a),      j  = i, a ∈ A(i) and i ∈ I,
                         p ij (a) =
                                  τp ij (a) + 1 − τ,  j = i, a ∈ A(i) and i ∈ I.
                For each stationary policy, the associated Markov chain {X n } in the perturbed model
                is aperiodic. It is not difficult to verify that for each stationary policy the average
                cost per time unit in the perturbed model is the same as that in the original model.
                For the unichain case this is an immediate consequence of the representation (6.2.7)
                for the average cost and the fact that for each stationary policy the Markov chain
                {X n } has the same equilibrium probabilities as the Markov chain {X n } in the origi-
                nal model. For the multichain case, a similar argument can be used to show that the
                two models are in fact equivalent. Thus the value-iteration algorithm can be applied
                to the perturbed model in order to solve the original model. In specific problems
                involving periodicities, the ‘optimal’ value of τ is usually not clear beforehand;
                                                  1
                empirical investigations indicate that τ =  is usually a satisfactory choice.
                                                  2
                Modified value iteration with a dynamic relaxation factor

                Value iteration does not have the fast convergence of policy iteration. The number
                of iterations required by the value-iteration algorithm is problem dependent and
                increases when the number of problem states gets larger. Also, the tolerance number
                ε in the stopping criterion affects the number of iterations required. The stopping
                criterion should be based on the lower and upper bounds m n and M n but not on
                any repetitive behaviour of the generated policies R(n).
                  The convergence rate of value iteration can often be accelerated by using a
                relaxation factor, such as in successive overrelaxation for solving a single system
                of linear equations. Then at the nth iteration a new approximation to the value
                function V n (i) is obtained by using both the previous values V n−1 (i) and the
                residuals V n (i)−V n−1 (i). It is possible to select dynamically a relaxation factor and
                thus avoid the experimental determination of the best value of a fixed relaxation
                factor. The following modification of the standard value-iteration algorithm can
                be formulated. Steps 0, 1, 2 and 3 are as before, while step 4 of the standard
                value-iteration algorithm is modified as follows.
                Step 4(a). Determine the states u and v such that

                           V n (u) − V n−1 (u) = m n  and V n (v) − V n−1 (v) = M n
                and compute the relaxation factor

                                               M n − m n
                       ω =                                                 ,
                            M n − m n +   {p uj (R u ) − p vj (R v )}{V n (j) − V n−1 (j)}
                                        j∈I
   266   267   268   269   270   271   272   273   274   275   276