Page 255 - A First Course In Stochastic Models

P. 255

248 DISCRETE-TIME MARKOV DECISION PROCESSES

Step 2 (policy-improvement step). For each state i ∈ I, determine an action a i
yielding the minimum in

 
 

min c i (a) − g(R) + p ij (a)υ j (R) .
a∈A(i)  
j∈I
The new stationary policy R is obtained by choosing R i = a i for all i ∈ I with the
convention that R i is chosen equal to the old action R i when this action minimizes
the policy-improvement quantity.
Step 3 (convergence test). If the new policy R = R, then the algorithm is stopped
with policy R. Otherwise, go to step 1 with R replaced by R.

The policy-iteration algorithm converges after a ﬁnite number of iterations to an
average cost optimal policy. We defer the proof to Section 6.7. The policy-iteration
algorithm is empirically found to be a remarkably robust algorithm that converges
very fast in speciﬁc problems. The number of iterations is practically independent
of the number of states and varies typically between 3 and 15, say. Also, it can be
roughly stated that the average costs of the policies generated by policy iteration
converge at least exponentially fast to the minimum average cost, with the greatest
improvements in the ﬁrst few iterations.

Remark 6.4.1 The average cost optimality equation
Since the policy-iteration algorithm converges after ﬁnitely many iterations, there
exist numbers g and υ , i ∈ I, such that
∗
∗
i
 
 

υ = min c i (a) − g + p ij (a)υ j ∗ , i ∈ I. (6.4.3)
∗
∗
i
a∈A(i)  
j∈I
This functional equation is called the average cost optimality equation. Using Theo-
rem 6.2.1, we can directly verify that any stationary policy R for which the action
∗
∗
R minimizes the right-hand side of (6.4.3) for all i ∈ I is average cost optimal.
i
To see this, note that

∗ ∗ ∗ ∗ ∗
υ = c i (R ) − g + p ij (R )υ , i ∈ I (6.4.4)
j
i
i
i
j∈I
and

∗ ∗ ∗
υ ≤ c i (a) − g + p ij (a)υ , a ∈ A(i) and i ∈ I. (6.4.5)
j
i
j∈I
∗
∗
The equality (6.4.4) and Theorem 6.2.1 imply that g(R ) = g . Let R be any sta-
tionary policy. Taking a = R i in (6.4.5) for all i ∈ I and applying Theorem 6.2.1,
we ﬁnd g(R) ≥ g . In other words, g(R ) ≤ g(R) for any stationary policy R.
∗
∗
∗
This shows not only that policy R is average cost optimal but also shows that the
∗
constant g in (6.4.3) is uniquely determined as the minimal average cost per time

250 251 252 253 254 255 256 257 258 259 260