Page 244 - A First Course In Stochastic Models
P. 244
THE POLICY-IMPROVEMENT IDEA 237
may be quite complicated in the sense that the prescribed actions may depend on
the whole history of the system. An important class of policies is the subclass of
stationary policies. A stationary policy R is a policy that assigns to each state i a
fixed action a = R i and always uses this action whenever the system is in state i.
For example, in the maintenance problem with N = 5, the policy R prescribing
a preventive repair only in the states 3 and 4 is given by R 1 = 0, R 2 = 0,
R 3 = R 4 = 1 and R 5 = R 6 = 2.
For n = 0, 1, . . . , define
X n = the state of the system at the nth decision epoch.
Under a given stationary policy R, we have
P {X n+1 = j | X n = i} = p ij (R i ),
regardless of the past history of the system up to time n. Hence under a given
stationary policy R the stochastic process {X n } is a discrete-time Markov chain with
one-step transition probabilities p ij (R i ). This Markov chain incurs a cost c i (R i )
each time the system visits state i. Thus we can invoke results from Markov chain
theory to specify the long-run average cost per time unit under a given stationary
policy.
In view of the Markov assumption made and the fact that the planning horizon
is infinitely long, it will be intuitively clear that it is sufficient to consider only the
class of stationary policies. However, other policies are conceivable: policies whose
actions depend on the past states or policies whose actions are determined by a
random mechanism. This issue raises a fundamental question in Markov decision
theory: does there exist an optimal policy among the class of all conceivable policies
and, if an optimal policy exists, is such a policy a stationary policy? The answer
to these questions is yes for the average-cost Markov decision model with a finite
state space and finite action sets. However, a mathematical proof requires rather
deep arguments. The interested reader is referred to Derman (1970) and Puterman
(1994) for a proof. From these books the reader will learn that the issue of the
existence of an optimal (stationary) policy is a very subtle one. Especially for the
average cost criterion, the optimality questions become very complicated when
the state space is not finite but countably infinite. Even in simple countable-state
models, average cost optimal policies need not exist and, when they do, they need
not be stationary; see Puterman (1994). In the average-cost Markov decision model
with a finite state space and finite action sets these difficulties do not arise and the
analysis can be restricted to the class of stationary policies.
6.2 THE POLICY-IMPROVEMENT IDEA
In this section we will establish a key result that underlies the various algorithms
for the computation of an average cost optimal policy. Before doing this, we discuss
the long-run average cost per time unit for a stationary policy.