Page 291 - A First Course In Stochastic Models
P. 291

ALGORITHMS FOR AN OPTIMAL POLICY                285

                Step 2 (policy-improvement step). For each state i ∈ I, determine an action a i
                yielding the minimum in

                                                                 
                                                                 

                              min   c i (a) − g(R)τ i (a) +  p ij (a)υ j (R) .
                              a∈A(i)                             
                                                     j∈I
                The new stationary policy R is obtained by choosing R i = a i for all i ∈ I with the
                convention that R i is chosen equal to the old action R i when this action minimizes
                the policy-improvement quantity.
                Step 3 (convergence test). If the new policy R = R, then the algorithm is stopped
                with policy R. Otherwise, go to step 1 with R replaced by R.

                  In the same way as for the discrete-time Markov decision model, it can be
                shown that the algorithm converges in a finite number of iterations to an average
                cost optimal policy. Also, as a consequence of the convergence of the algorithm,
                                 ∗
                                        ∗
                there exist numbers g and υ satisfying the average cost optimality equation
                                        i
                                                           
                                                           

                       υ = min    c i (a) − g τ i (a) +  p ij (a)υ j ∗  ,  i ∈ I.  (7.2.2)
                        ∗
                                          ∗
                        i
                            a∈A(i)                         
                                                  j∈I
                The constant g is uniquely determined as the minimal average cost per time unit.
                            ∗
                Moreover, each stationary policy whose actions minimize the right-hand side of
                (7.2.2) for all i ∈ I is average cost optimal. The proof of these statements is left
                as an exercise for the reader.
                Value-iteration algorithm
                For the semi-Markov decision model the formulation of a value-iteration algorithm
                is not straightforward. A recursion relation for the minimal expected costs over the
                first n decision epochs does not take into account the non-identical transition times
                and thus these costs cannot be related to the minimal average cost per time unit.
                However, by the data transformation method from Section 7.1, we can convert the
                semi-Markov decision model into a discrete-time Markov decision model such that
                both models have the same average cost for each stationary policy. A value-iteration
                algorithm for the original semi-Markov decision model is then implied by the value-
                iteration algorithm for the transformed discrete-time Markov decision model. In the
                discrete-time model it is no restriction to assume that all c i (a) = c i (a)/τ i (a) are
                positive; otherwise, add a sufficiently large positive constant to each c i (a). The
                following recursion method results for the semi-Markov decision model:

                Step 0. Choose V 0 (i) such that 0 ≤ V 0 (i) ≤ min a {c i (a)/τ i (a)} for all i. Choose a
                number τ with 0 < τ ≤ min i,a τ i (a). Let n := 1.
   286   287   288   289   290   291   292   293   294   295   296