Page 244 - A First Course In Stochastic Models
P. 244

THE POLICY-IMPROVEMENT IDEA                  237

                may be quite complicated in the sense that the prescribed actions may depend on
                the whole history of the system. An important class of policies is the subclass of
                stationary policies. A stationary policy R is a policy that assigns to each state i a
                fixed action a = R i and always uses this action whenever the system is in state i.
                For example, in the maintenance problem with N = 5, the policy R prescribing
                a preventive repair only in the states 3 and 4 is given by R 1 = 0, R 2 = 0,
                R 3 = R 4 = 1 and R 5 = R 6 = 2.
                  For n = 0, 1, . . . , define

                           X n = the state of the system at the nth decision epoch.
                Under a given stationary policy R, we have

                                   P {X n+1 = j | X n = i} = p ij (R i ),
                regardless of the past history of the system up to time n. Hence under a given
                stationary policy R the stochastic process {X n } is a discrete-time Markov chain with
                one-step transition probabilities p ij (R i ). This Markov chain incurs a cost c i (R i )
                each time the system visits state i. Thus we can invoke results from Markov chain
                theory to specify the long-run average cost per time unit under a given stationary
                policy.
                  In view of the Markov assumption made and the fact that the planning horizon
                is infinitely long, it will be intuitively clear that it is sufficient to consider only the
                class of stationary policies. However, other policies are conceivable: policies whose
                actions depend on the past states or policies whose actions are determined by a
                random mechanism. This issue raises a fundamental question in Markov decision
                theory: does there exist an optimal policy among the class of all conceivable policies
                and, if an optimal policy exists, is such a policy a stationary policy? The answer
                to these questions is yes for the average-cost Markov decision model with a finite
                state space and finite action sets. However, a mathematical proof requires rather
                deep arguments. The interested reader is referred to Derman (1970) and Puterman
                (1994) for a proof. From these books the reader will learn that the issue of the
                existence of an optimal (stationary) policy is a very subtle one. Especially for the
                average cost criterion, the optimality questions become very complicated when
                the state space is not finite but countably infinite. Even in simple countable-state
                models, average cost optimal policies need not exist and, when they do, they need
                not be stationary; see Puterman (1994). In the average-cost Markov decision model
                with a finite state space and finite action sets these difficulties do not arise and the
                analysis can be restricted to the class of stationary policies.


                            6.2  THE POLICY-IMPROVEMENT IDEA

                In this section we will establish a key result that underlies the various algorithms
                for the computation of an average cost optimal policy. Before doing this, we discuss
                the long-run average cost per time unit for a stationary policy.
   239   240   241   242   243   244   245   246   247   248   249