Page 255 - A First Course In Stochastic Models
P. 255

248             DISCRETE-TIME MARKOV DECISION PROCESSES

                Step 2 (policy-improvement step). For each state i ∈ I, determine an action a i
                yielding the minimum in

                                                               
                                                               

                                 min  c i (a) − g(R) +  p ij (a)υ j (R) .
                                a∈A(i)                         
                                                   j∈I
                The new stationary policy R is obtained by choosing R i = a i for all i ∈ I with the
                convention that R i is chosen equal to the old action R i when this action minimizes
                the policy-improvement quantity.
                Step 3 (convergence test). If the new policy R = R, then the algorithm is stopped
                with policy R. Otherwise, go to step 1 with R replaced by R.

                  The policy-iteration algorithm converges after a finite number of iterations to an
                average cost optimal policy. We defer the proof to Section 6.7. The policy-iteration
                algorithm is empirically found to be a remarkably robust algorithm that converges
                very fast in specific problems. The number of iterations is practically independent
                of the number of states and varies typically between 3 and 15, say. Also, it can be
                roughly stated that the average costs of the policies generated by policy iteration
                converge at least exponentially fast to the minimum average cost, with the greatest
                improvements in the first few iterations.

                Remark 6.4.1 The average cost optimality equation
                Since the policy-iteration algorithm converges after finitely many iterations, there
                exist numbers g and υ , i ∈ I, such that
                             ∗
                                   ∗
                                   i
                                                            
                                                            

                            υ = min     c i (a) − g +  p ij (a)υ j ∗  ,  i ∈ I.  (6.4.3)
                             ∗
                                               ∗
                             i
                                 a∈A(i)                     
                                                   j∈I
                This functional equation is called the average cost optimality equation. Using Theo-
                rem 6.2.1, we can directly verify that any stationary policy R for which the action
                                                                 ∗
                 ∗
                R minimizes the right-hand side of (6.4.3) for all i ∈ I is average cost optimal.
                 i
                To see this, note that

                                 ∗      ∗    ∗          ∗  ∗
                               υ = c i (R ) − g +   p ij (R )υ ,  i ∈ I      (6.4.4)
                                                           j
                                        i
                                i
                                                        i
                                                j∈I
                and

                            ∗          ∗            ∗
                           υ ≤ c i (a) − g +  p ij (a)υ ,  a ∈ A(i) and i ∈ I.  (6.4.5)
                                                    j
                            i
                                           j∈I
                                                                  ∗
                                                             ∗
                The equality (6.4.4) and Theorem 6.2.1 imply that g(R ) = g . Let R be any sta-
                tionary policy. Taking a = R i in (6.4.5) for all i ∈ I and applying Theorem 6.2.1,
                we find g(R) ≥ g . In other words, g(R ) ≤ g(R) for any stationary policy R.
                               ∗
                                                  ∗
                                           ∗
                This shows not only that policy R is average cost optimal but also shows that the
                        ∗
                constant g in (6.4.3) is uniquely determined as the minimal average cost per time
   250   251   252   253   254   255   256   257   258   259   260