Page 254 - A First Course In Stochastic Models

P. 254

POLICY-ITERATION ALGORITHM 247

6.4 POLICY-ITERATION ALGORITHM

For ease of presentation we will discuss the policy-iteration algorithm under a
unichain assumption that is satisﬁed in most applications.

Unichain assumption For each stationary policy the associated Markov chain
{X n } has no two disjoint closed sets.

The relative values associated with a given policy R provide a tool for construct-
ing a new policy R whose average cost is no more than that of the current policy
R. In order to improve a given policy R whose average cost g(R) and relative
values υ i (R), i ∈ I, have been computed, we apply Theorem 6.2.1 with g = g(R)
and υ i = υ i (R), i ∈ I. By constructing a new policy R such that, for each state
i ∈ I,

c i (R i ) − g(R) + p ij (R i )υ j ≤ υ i , (6.4.1)
j∈I

we obtain an improved rule R according to g(R) ≤ g(R). In constructing such
an improved policy R it is important to realize that for each state i separately
an action R i satisfying (6.4.1) can be determined. As a side remark, we point
out that this ﬂexibility of the policy-improvement procedure may be exploited in
speciﬁc applications to generate a sequence of improved policies within a subclass
of policies having a simple structure. A particular way to ﬁnd for state i ∈ I an
action R i satisfying (6.4.1) is to minimize

c i (a) − g(R) + p ij (a)υ j (R) (6.4.2)
j∈I
with respect to a ∈ A(i). Noting that the expression in (6.4.2) equals υ i (R) for
a = R i , it follows that (6.4.1) is satisﬁed for the action R i which minimizes (6.4.2)
with respect to a ∈ A(i). We are now in a position to formulate the following
algorithm.

Policy-iteration algorithm
Step 0 (initialization). Choose a stationary policy R.
Step 1 (value-determination step). For the current rule R, compute the unique
solution {g(R), υ i (R)} to the following system of linear equations:

υ i = c i (R i ) − g + p ij (R i )υ j , i ∈ I,
j∈I
υ s = 0,

where s is an arbitrarily chosen state.

249 250 251 252 253 254 255 256 257 258 259