Page 290 - A First Course In Stochastic Models
P. 290

284                 SEMI-MARKOV DECISION PROCESSES

                and so, by Theorem 7.1.1, g(R) = g(R). Thus we can conclude that an average
                cost optimal policy in the semi-Markov model can be obtained by solving an
                appropriate discrete-time Markov decision model. This conclusion is particularly
                useful with respect to the value-iteration algorithm. In applying value iteration to
                the transformed model, it is no restriction to assume that for each stationary policy
                the associated Markov chain {X n } is aperiodic. By choosing the constant τ strictly
                less than min i,a τ i (a), we always have p ii (a) > 0 for all i, a and thus the required
                aperiodicity.



                        7.2  ALGORITHMS FOR AN OPTIMAL POLICY
                In this section we outline how the algorithms for the discrete-time Markov decision
                model can be extended to the semi-Markov decision model.


                Policy-iteration algorithm
                The policy-iteration algorithm will be described under the unichain assumption.
                This assumption requires that for each stationary policy the embedded Markov
                chain {X n } has no two disjoint closed sets. By data transformation, it is directly ver-
                ified that the value-determination equations (6.3.2) for a given stationary policy R
                remain valid provided that we replace g by gτ i (R i ). The policy-improvement pro-
                cedure from Theorem 6.2.1 also remains valid when we replace g by gτ i (R i ).
                Suppose that g(R) and υ i (R), i ∈ I, are the average cost and the relative values
                of a stationary policy R. If a stationary policy R is constructed such that, for each
                state i ∈ I,

                            c i (R i ) − g(R)τ i (R i ) +  p ij (R i )υ j (R) ≤ υ i (R),  (7.2.1)
                                                j∈I
                then g(R) ≤ g(R). Moreover, g(R) < g(R) if the strict inequality sign holds in
                (7.2.1) for some state i which is recurrent under R. These statements can be verified
                by the same arguments as used in the second part of the proof of Theorem 6.2.1.
                  Under the unichain assumption, we can now formulate the following policy-
                iteration algorithm:
                Step 0 (initialization). Choose a stationary policy R.
                Step 1 (value-determination step). For the current rule R, compute the average cost
                g(R) and the relative values υ i (R), i ∈ I, as the unique solution to the linear
                equations

                             υ i = c i (R i ) − gτ i (R i ) +  p ij (R i )υ j ,  i ∈ I,
                                                  j∈I
                             υ s = 0,

                where s is an arbitrarily chosen state.
   285   286   287   288   289   290   291   292   293   294   295