Page 254 - A First Course In Stochastic Models
P. 254
POLICY-ITERATION ALGORITHM 247
6.4 POLICY-ITERATION ALGORITHM
For ease of presentation we will discuss the policy-iteration algorithm under a
unichain assumption that is satisfied in most applications.
Unichain assumption For each stationary policy the associated Markov chain
{X n } has no two disjoint closed sets.
The relative values associated with a given policy R provide a tool for construct-
ing a new policy R whose average cost is no more than that of the current policy
R. In order to improve a given policy R whose average cost g(R) and relative
values υ i (R), i ∈ I, have been computed, we apply Theorem 6.2.1 with g = g(R)
and υ i = υ i (R), i ∈ I. By constructing a new policy R such that, for each state
i ∈ I,
c i (R i ) − g(R) + p ij (R i )υ j ≤ υ i , (6.4.1)
j∈I
we obtain an improved rule R according to g(R) ≤ g(R). In constructing such
an improved policy R it is important to realize that for each state i separately
an action R i satisfying (6.4.1) can be determined. As a side remark, we point
out that this flexibility of the policy-improvement procedure may be exploited in
specific applications to generate a sequence of improved policies within a subclass
of policies having a simple structure. A particular way to find for state i ∈ I an
action R i satisfying (6.4.1) is to minimize
c i (a) − g(R) + p ij (a)υ j (R) (6.4.2)
j∈I
with respect to a ∈ A(i). Noting that the expression in (6.4.2) equals υ i (R) for
a = R i , it follows that (6.4.1) is satisfied for the action R i which minimizes (6.4.2)
with respect to a ∈ A(i). We are now in a position to formulate the following
algorithm.
Policy-iteration algorithm
Step 0 (initialization). Choose a stationary policy R.
Step 1 (value-determination step). For the current rule R, compute the unique
solution {g(R), υ i (R)} to the following system of linear equations:
υ i = c i (R i ) − g + p ij (R i )υ j , i ∈ I,
j∈I
υ s = 0,
where s is an arbitrarily chosen state.