Page 249 - A First Course In Stochastic Models

P. 249

242 DISCRETE-TIME MARKOV DECISION PROCESSES

This relation and the deﬁnition of (i, a, R) suggest we should look for an action
a in state i so that the quantity

c i (a) − g(R) + p ij (a)v j (R) (6.2.10)
j∈I
is as small as possible. The quantity in (6.2.10) is called the policy-improvement
quantity. The above heuristic discussion suggests a main theorem that will be the
basis for the algorithms to be discussed later. A direct proof of this theorem can
be given without using any of the heuristic assumptions made above.

Theorem 6.2.1 (improvement theorem) Let g and v i , i ∈ I, be given numbers.
Suppose that the stationary policy R has the property

c i (R i ) − g + p ij (R i )υ j ≤ υ i f or each i ∈ I. (6.2.11)
j∈I
Then the long-run average cost of policy R satisﬁes
g i (R) ≤ g, i ∈ I, (6.2.12)

where the strict inequality sign holds in (6.2.12) for i = r when state r is recurrent
under policy R and the strict inequality sign holds in (6.2.11) for i = r. The result
is also true when the inequality signs in (6.2.11) and (6.2.12) are reversed.
Proof We ﬁrst give an intuitive explanation of the theorem and next we give a
formal proof. Suppose that a control cost of c i (a) − g is incurred each time the
action a is chosen in state i, while a terminal cost of υ j is incurred when the
control of the system is stopped and the system is left behind in state j. Then
(6.2.11) states that controlling the system for one step according to rule R and
stopping next is preferable to stopping directly when the initial state is i. Since this
property is true for each initial state, a repeated application of this property yields
that controlling the system for m steps according to rule R and stopping after that
is preferable to stopping directly. Thus, for each initial state i ∈ I,

(m)
V m (i, R) − mg + p (R)υ j ≤ υ i , m = 1, 2, . . . .
ij
j∈I
Dividing both sides of this inequality by m and letting m → ∞, we get (6.2.12).
Next we give a formal proof that yields the result with the strict inequality sign
as well. The proof is ﬁrst given under the assumption that the Markov chain {X n }
associated with policy R is unichain. Then this Markov chain has a unique equi-
librium distribution {π j (R), j ∈ I}, where π j (R) > 0 only if state j is recurrent
under policy R. Multiply both sides of (6.2.11) by π i (R) and sum over i. This
gives

π i (R)c i (R i ) − g + π i (R) p ij (R i )υ j ≤ π i (R)υ i ,
i∈I i∈I j∈I i∈I

244 245 246 247 248 249 250 251 252 253 254