Page 241 - A First Course In Stochastic Models
P. 241

234             DISCRETE-TIME MARKOV DECISION PROCESSES

                  The Markov decision model has many potential applications in inventory control,
                maintenance, manufacturing and telecommunication among others. Perhaps this
                versatile model will see many more significant applications when it becomes more
                familiar to engineers, operations research analysts, computer science people and
                others. To that end, Chapters 6 and 7 focus on the algorithmic aspects of Markov
                decision theory and illustrate the wide applicability of the Markov decision model
                to a variety of realistic problems. The presentation is confined to the optimality
                criterion of the long-run average cost (reward) per time unit. For many applications
                of Markov decision theory this criterion is the most appropriate optimality criterion.
                The average cost criterion is particularly appropriate when many state transitions
                occur in a relatively short time, as is typically the case for stochastic control
                problems in computer systems and telecommunication networks. Other criteria are
                the expected total cost and the expected total discounted cost. These criteria are
                discussed in length in Puterman (1994) and will not be addressed in this book.
                  This chapter deals with the discrete-time Markov decision model in which deci-
                sions can be made only at fixed equidistant points in time. The semi-Markov
                decision model in which the times between the decision epochs are random will
                be the subject of the next chapter. In Section 6.1 we present the basic elements
                of the discrete-time Markov decision model. A policy-improvement procedure is
                discussed in Section 6.2. This procedure is the key to various algorithms for com-
                puting an average cost optimal polity. The so-called relative values of a given
                policy play an important role in the improvement procedure. The relative values
                and their interpretation are the subject of Section 6.3. In Section 6.4 we present
                the policy-iteration algorithm which generates a sequence of improved policies.
                Section 6.5 discusses the linear programming formulation for the Markov decision
                model, including a formulation to handle probabilistic constraints on the state-action
                frequencies. The policy-iteration algorithm and the linear programming formulation
                both require the solving of a system of linear equations in each iteration step. In
                Section 6.6 we discuss the alternative method of value iteration which avoids the
                computationally burdensome solving of systems of linear equations but involves
                only recursive computations. The value-iteration algorithm endowed with quickly
                converging lower and upper bounds on the minimal average cost is usually the
                most effective method for solving Markov decision problems with a large number
                of states. Section 6.7 gives convergence proofs for the policy-iteration algorithm
                and the value-iteration algorithm.


                                        6.1  THE MODEL
                In Chapter 3 we have considered a dynamic system that evolves over time accord-
                ing to a fixed probabilistic law of motion satisfying the Markovian assumption.
                This assumption states that the next state to be visited depends only on the present
                state of the system. In this chapter we deal with a dynamic system evolving over
                time where the probabilistic law of motion can be controlled by taking decisions.
                Also, costs are incurred (or rewards are earned) as a consequence of the decisions
   236   237   238   239   240   241   242   243   244   245   246