Page 263 - A First Course In Stochastic Models

P. 263

256 DISCRETE-TIME MARKOV DECISION PROCESSES

problem involving a randomly changing state, a constraint may be placed on the
frequency at which a certain inoperative state occurs.
The following illustrative example taken from Wagner (1975) shows that for
control problems with probabilistic constraints it may be optimal to choose the
decisions in a random way rather than in a deterministic way. Suppose the daily
demand D for some product is described by the probability distribution

2
1
P {D = 0} = P {D = 1} = , P {D = 2} = .
6 3
The demands on the successive days are independent of each other. At the beginning
of each day it has to be decided how much to order of the product. The delivery
of any order is instantaneous. The variable ordering cost of each unit is c > 0.
Any unit that is not sold at the end of the day becomes obsolete and must be
discarded. The decision problem is to minimize the average ordering cost per day,
1
subject to the constraint that the fraction of the demand to be met is at least . This
3
probabilistic constraint is satisﬁed when using the policy of ordering one unit every
day, a policy which has an average cost of c per day. However, this deterministic
control rule is not optimal, as can be seen by considering the randomized control
rule under which at any given day no unit is ordered with probability 4 and two
5
units are ordered with probability 1 . Under this randomized rule the probability
5
4 1 1 1
that the daily demand is met equals ( )( ) + ( )(1) = 3 and the average ordering
5
6
5
1
4
2
cost per day equals ( )(0) + ( )(2c) = c. It is readily seen that the randomized
5 5 5
rule is optimal.
So far we have considered only stationary policies under which the actions
are chosen deterministically. A policy π is called a stationary randomized policy
when it is described by a probability distribution {π a (i), a ∈ A(i)} for each state
i ∈ I. Under policy π action a ∈ A(i) is chosen with probability π a (i) whenever
the process is in state i. If π a (i) is 0 or 1 for every i and a, the stationary
randomized policy π reduces to the familiar stationary policy choosing the actions
in a deterministic way. For any policy π, let the state-action frequencies f i,a (π)
be deﬁned by
f ia (π) = the long-run fraction of decision epochs at which the process
is in state i and action a is chosen when policy π is used.
Consider now a Markovian decision problem in which the goal is to minimize the
long-run average cost per time unit subject to the following linear constraints on
the state-action frequencies:

(s) (s)
α f ia (π) ≤ β , s = 1, . . . , L,
ia
i∈I a∈A(i)
(s) (s)
where α and β are given constants. It is assumed that the constraints allow
ia
for a feasible solution. If the unichain assumption from Section 6.4 holds, it can
be shown that an optimal policy may be obtained by solving the following linear

258 259 260 261 262 263 264 265 266 267 268