Page 265 - A First Course In Stochastic Models
P. 265

258             DISCRETE-TIME MARKOV DECISION PROCESSES

                A Lagrange-multiplier approach for probabilistic constraints
                A heuristic approach for handling probabilistic constraints is the Lagrange-multi-
                plier method. This method produces only stationary non-randomized policies. To
                describe the method, assume a single probabilistic constraint


                                               α ia f ia (π) ≤ β
                                        i∈I a∈A(i)
                on the state-action frequencies. In the Lagrange-multiplier method, the constraint
                is eliminated by putting it into the criterion function by means of a Lagrange
                multiplier λ ≥ 0. That is, the goal function is changed from    i,a i (a)x ia to
                                                                          c

                    c
                  i,a i (a)x ia + λ(  i,a  α ia x ia − β). The Lagrange multiplier may be interpreted as
                the cost to each unit that is used from some resource. The original Markov decision
                problem without probabilistic constraint is obtained by taking λ = 0. It is assumed
                that the probabilistic constraint is not satisfied for the optimal stationary policy in
                the unconstrained problem; otherwise, this policy is optimal for the constrained
                problem as well. Thus, for a given value of the Lagrange multiplier λ > 0, we
                consider the unconstrained Markov decision problem with one-step costs

                                          λ
                                         c (a) = c i (a) + λα ia
                                         i
                and one-step transition probabilities p ij (a) as before. Solving this unconstrained
                Markov decision problem yields an optimal deterministic policy R(λ) that pre-
                scribes always a fixed action R i (λ) whenever the system is in state i. Let β(λ) be
                the constraint level associated with policy R(λ), that is,

                                    β(λ) =   α i,R i (λ) f i,R i (λ) (R(λ)).
                                          i∈I
                If β(λ) > β one should increase λ, otherwise one should decrease λ. Why? The
                Lagrange multiplier λ should be adjusted until the smallest value of λ is found
                for which β(λ) ≤ β. Bisection is a convenient method to adjust λ. How do we
                calculate β(λ) for a given value of λ? To do so, observe that β(λ) can be interpreted
                as the average cost in a single Markov chain with an appropriate cost structure.
                Consider the Markov chain describing the state of the system under policy R(λ).
                In this Markov process, the long-run average cost per time unit equals β(λ) when
                it is assumed that a direct cost of α i,R i (λ) is incurred each time the process visits
                state i. An effective method to compute the average cost β(λ) is to apply value
                iteration to a single Markov chain; see Example 6.6.1 in the next section.
                  The average cost of the stationary policy obtained by the Lagrangian approach
                will in general be larger than the average cost of the stationary randomized policy
                resulting from the linear programming formulation. Also, it should be pointed out
                that there is no guarantee that the policy obtained by the Lagrangian approach is
                the best policy among all stationary policies satisfying the probabilistic constraint,
                although in most practical situations this may be expected to be the case. In spite
   260   261   262   263   264   265   266   267   268   269   270