Page 254 - A First Course In Stochastic Models
P. 254

POLICY-ITERATION ALGORITHM                   247

                            6.4   POLICY-ITERATION ALGORITHM

                For ease of presentation we will discuss the policy-iteration algorithm under a
                unichain assumption that is satisfied in most applications.

                Unichain assumption  For each stationary policy the associated Markov chain
                {X n } has no two disjoint closed sets.

                  The relative values associated with a given policy R provide a tool for construct-
                ing a new policy R whose average cost is no more than that of the current policy
                R. In order to improve a given policy R whose average cost g(R) and relative
                values υ i (R), i ∈ I, have been computed, we apply Theorem 6.2.1 with g = g(R)
                and υ i = υ i (R), i ∈ I. By constructing a new policy R such that, for each state
                i ∈ I,

                                  c i (R i ) − g(R) +  p ij (R i )υ j ≤ υ i ,  (6.4.1)
                                                j∈I

                we obtain an improved rule R according to g(R) ≤ g(R). In constructing such
                an improved policy R it is important to realize that for each state i separately
                an action R i satisfying (6.4.1) can be determined. As a side remark, we point
                out that this flexibility of the policy-improvement procedure may be exploited in
                specific applications to generate a sequence of improved policies within a subclass
                of policies having a simple structure. A particular way to find for state i ∈ I an
                action R i satisfying (6.4.1) is to minimize


                                    c i (a) − g(R) +  p ij (a)υ j (R)        (6.4.2)
                                                 j∈I
                with respect to a ∈ A(i). Noting that the expression in (6.4.2) equals υ i (R) for
                a = R i , it follows that (6.4.1) is satisfied for the action R i which minimizes (6.4.2)
                with respect to a ∈ A(i). We are now in a position to formulate the following
                algorithm.


                Policy-iteration algorithm
                Step 0 (initialization). Choose a stationary policy R.
                Step 1 (value-determination step). For the current rule R, compute the unique
                solution {g(R), υ i (R)} to the following system of linear equations:


                                υ i = c i (R i ) − g +  p ij (R i )υ j ,  i ∈ I,
                                                j∈I
                                υ s = 0,

                where s is an arbitrarily chosen state.
   249   250   251   252   253   254   255   256   257   258   259