Page 255 - A First Course In Stochastic Models
P. 255
248 DISCRETE-TIME MARKOV DECISION PROCESSES
Step 2 (policy-improvement step). For each state i ∈ I, determine an action a i
yielding the minimum in
min c i (a) − g(R) + p ij (a)υ j (R) .
a∈A(i)
j∈I
The new stationary policy R is obtained by choosing R i = a i for all i ∈ I with the
convention that R i is chosen equal to the old action R i when this action minimizes
the policy-improvement quantity.
Step 3 (convergence test). If the new policy R = R, then the algorithm is stopped
with policy R. Otherwise, go to step 1 with R replaced by R.
The policy-iteration algorithm converges after a finite number of iterations to an
average cost optimal policy. We defer the proof to Section 6.7. The policy-iteration
algorithm is empirically found to be a remarkably robust algorithm that converges
very fast in specific problems. The number of iterations is practically independent
of the number of states and varies typically between 3 and 15, say. Also, it can be
roughly stated that the average costs of the policies generated by policy iteration
converge at least exponentially fast to the minimum average cost, with the greatest
improvements in the first few iterations.
Remark 6.4.1 The average cost optimality equation
Since the policy-iteration algorithm converges after finitely many iterations, there
exist numbers g and υ , i ∈ I, such that
∗
∗
i
υ = min c i (a) − g + p ij (a)υ j ∗ , i ∈ I. (6.4.3)
∗
∗
i
a∈A(i)
j∈I
This functional equation is called the average cost optimality equation. Using Theo-
rem 6.2.1, we can directly verify that any stationary policy R for which the action
∗
∗
R minimizes the right-hand side of (6.4.3) for all i ∈ I is average cost optimal.
i
To see this, note that
∗ ∗ ∗ ∗ ∗
υ = c i (R ) − g + p ij (R )υ , i ∈ I (6.4.4)
j
i
i
i
j∈I
and
∗ ∗ ∗
υ ≤ c i (a) − g + p ij (a)υ , a ∈ A(i) and i ∈ I. (6.4.5)
j
i
j∈I
∗
∗
The equality (6.4.4) and Theorem 6.2.1 imply that g(R ) = g . Let R be any sta-
tionary policy. Taking a = R i in (6.4.5) for all i ∈ I and applying Theorem 6.2.1,
we find g(R) ≥ g . In other words, g(R ) ≤ g(R) for any stationary policy R.
∗
∗
∗
This shows not only that policy R is average cost optimal but also shows that the
∗
constant g in (6.4.3) is uniquely determined as the minimal average cost per time

