Page 263 - A First Course In Stochastic Models
P. 263

256             DISCRETE-TIME MARKOV DECISION PROCESSES

                problem involving a randomly changing state, a constraint may be placed on the
                frequency at which a certain inoperative state occurs.
                  The following illustrative example taken from Wagner (1975) shows that for
                control problems with probabilistic constraints it may be optimal to choose the
                decisions in a random way rather than in a deterministic way. Suppose the daily
                demand D for some product is described by the probability distribution

                                                                   2
                                                    1
                              P {D = 0} = P {D = 1} = ,  P {D = 2} = .
                                                    6              3
                The demands on the successive days are independent of each other. At the beginning
                of each day it has to be decided how much to order of the product. The delivery
                of any order is instantaneous. The variable ordering cost of each unit is c > 0.
                Any unit that is not sold at the end of the day becomes obsolete and must be
                discarded. The decision problem is to minimize the average ordering cost per day,
                                                                             1
                subject to the constraint that the fraction of the demand to be met is at least . This
                                                                             3
                probabilistic constraint is satisfied when using the policy of ordering one unit every
                day, a policy which has an average cost of c per day. However, this deterministic
                control rule is not optimal, as can be seen by considering the randomized control
                rule under which at any given day no unit is ordered with probability  4  and two
                                                                          5
                units are ordered with probability  1 . Under this randomized rule the probability
                                             5
                                              4  1    1      1
                that the daily demand is met equals ( )( ) + ( )(1) =  3  and the average ordering
                                                      5
                                                 6
                                              5
                                         1
                                 4
                                                 2
                cost per day equals ( )(0) + ( )(2c) = c. It is readily seen that the randomized
                                 5       5       5
                rule is optimal.
                  So far we have considered only stationary policies under which the actions
                are chosen deterministically. A policy π is called a stationary randomized policy
                when it is described by a probability distribution {π a (i), a ∈ A(i)} for each state
                i ∈ I. Under policy π action a ∈ A(i) is chosen with probability π a (i) whenever
                the process is in state i. If π a (i) is 0 or 1 for every i and a, the stationary
                randomized policy π reduces to the familiar stationary policy choosing the actions
                in a deterministic way. For any policy π, let the state-action frequencies f i,a (π)
                be defined by
                    f ia (π) = the long-run fraction of decision epochs at which the process
                            is in state i and action a is chosen when policy π is used.
                Consider now a Markovian decision problem in which the goal is to minimize the
                long-run average cost per time unit subject to the following linear constraints on
                the state-action frequencies:

                                        (s)        (s)
                                       α f ia (π) ≤ β  ,  s = 1, . . . , L,
                                        ia
                               i∈I a∈A(i)
                       (s)    (s)
                where α  and β   are given constants. It is assumed that the constraints allow
                       ia
                for a feasible solution. If the unichain assumption from Section 6.4 holds, it can
                be shown that an optimal policy may be obtained by solving the following linear
   258   259   260   261   262   263   264   265   266   267   268