Page 265 - A First Course In Stochastic Models
P. 265
258 DISCRETE-TIME MARKOV DECISION PROCESSES
A Lagrange-multiplier approach for probabilistic constraints
A heuristic approach for handling probabilistic constraints is the Lagrange-multi-
plier method. This method produces only stationary non-randomized policies. To
describe the method, assume a single probabilistic constraint
α ia f ia (π) ≤ β
i∈I a∈A(i)
on the state-action frequencies. In the Lagrange-multiplier method, the constraint
is eliminated by putting it into the criterion function by means of a Lagrange
multiplier λ ≥ 0. That is, the goal function is changed from i,a i (a)x ia to
c
c
i,a i (a)x ia + λ( i,a α ia x ia − β). The Lagrange multiplier may be interpreted as
the cost to each unit that is used from some resource. The original Markov decision
problem without probabilistic constraint is obtained by taking λ = 0. It is assumed
that the probabilistic constraint is not satisfied for the optimal stationary policy in
the unconstrained problem; otherwise, this policy is optimal for the constrained
problem as well. Thus, for a given value of the Lagrange multiplier λ > 0, we
consider the unconstrained Markov decision problem with one-step costs
λ
c (a) = c i (a) + λα ia
i
and one-step transition probabilities p ij (a) as before. Solving this unconstrained
Markov decision problem yields an optimal deterministic policy R(λ) that pre-
scribes always a fixed action R i (λ) whenever the system is in state i. Let β(λ) be
the constraint level associated with policy R(λ), that is,
β(λ) = α i,R i (λ) f i,R i (λ) (R(λ)).
i∈I
If β(λ) > β one should increase λ, otherwise one should decrease λ. Why? The
Lagrange multiplier λ should be adjusted until the smallest value of λ is found
for which β(λ) ≤ β. Bisection is a convenient method to adjust λ. How do we
calculate β(λ) for a given value of λ? To do so, observe that β(λ) can be interpreted
as the average cost in a single Markov chain with an appropriate cost structure.
Consider the Markov chain describing the state of the system under policy R(λ).
In this Markov process, the long-run average cost per time unit equals β(λ) when
it is assumed that a direct cost of α i,R i (λ) is incurred each time the process visits
state i. An effective method to compute the average cost β(λ) is to apply value
iteration to a single Markov chain; see Example 6.6.1 in the next section.
The average cost of the stationary policy obtained by the Lagrangian approach
will in general be larger than the average cost of the stationary randomized policy
resulting from the linear programming formulation. Also, it should be pointed out
that there is no guarantee that the policy obtained by the Lagrangian approach is
the best policy among all stationary policies satisfying the probabilistic constraint,
although in most practical situations this may be expected to be the case. In spite