Page 241 - A First Course In Stochastic Models
P. 241
234 DISCRETE-TIME MARKOV DECISION PROCESSES
The Markov decision model has many potential applications in inventory control,
maintenance, manufacturing and telecommunication among others. Perhaps this
versatile model will see many more significant applications when it becomes more
familiar to engineers, operations research analysts, computer science people and
others. To that end, Chapters 6 and 7 focus on the algorithmic aspects of Markov
decision theory and illustrate the wide applicability of the Markov decision model
to a variety of realistic problems. The presentation is confined to the optimality
criterion of the long-run average cost (reward) per time unit. For many applications
of Markov decision theory this criterion is the most appropriate optimality criterion.
The average cost criterion is particularly appropriate when many state transitions
occur in a relatively short time, as is typically the case for stochastic control
problems in computer systems and telecommunication networks. Other criteria are
the expected total cost and the expected total discounted cost. These criteria are
discussed in length in Puterman (1994) and will not be addressed in this book.
This chapter deals with the discrete-time Markov decision model in which deci-
sions can be made only at fixed equidistant points in time. The semi-Markov
decision model in which the times between the decision epochs are random will
be the subject of the next chapter. In Section 6.1 we present the basic elements
of the discrete-time Markov decision model. A policy-improvement procedure is
discussed in Section 6.2. This procedure is the key to various algorithms for com-
puting an average cost optimal polity. The so-called relative values of a given
policy play an important role in the improvement procedure. The relative values
and their interpretation are the subject of Section 6.3. In Section 6.4 we present
the policy-iteration algorithm which generates a sequence of improved policies.
Section 6.5 discusses the linear programming formulation for the Markov decision
model, including a formulation to handle probabilistic constraints on the state-action
frequencies. The policy-iteration algorithm and the linear programming formulation
both require the solving of a system of linear equations in each iteration step. In
Section 6.6 we discuss the alternative method of value iteration which avoids the
computationally burdensome solving of systems of linear equations but involves
only recursive computations. The value-iteration algorithm endowed with quickly
converging lower and upper bounds on the minimal average cost is usually the
most effective method for solving Markov decision problems with a large number
of states. Section 6.7 gives convergence proofs for the policy-iteration algorithm
and the value-iteration algorithm.
6.1 THE MODEL
In Chapter 3 we have considered a dynamic system that evolves over time accord-
ing to a fixed probabilistic law of motion satisfying the Markovian assumption.
This assumption states that the next state to be visited depends only on the present
state of the system. In this chapter we deal with a dynamic system evolving over
time where the probabilistic law of motion can be controlled by taking decisions.
Also, costs are incurred (or rewards are earned) as a consequence of the decisions