Page 249 - A First Course In Stochastic Models
P. 249

242             DISCRETE-TIME MARKOV DECISION PROCESSES

                This relation and the definition of  (i, a, R) suggest we should look for an action
                a in state i so that the quantity


                                    c i (a) − g(R) +  p ij (a)v j (R)       (6.2.10)
                                                 j∈I
                is as small as possible. The quantity in (6.2.10) is called the policy-improvement
                quantity. The above heuristic discussion suggests a main theorem that will be the
                basis for the algorithms to be discussed later. A direct proof of this theorem can
                be given without using any of the heuristic assumptions made above.

                Theorem 6.2.1 (improvement theorem)  Let g and v i , i ∈ I, be given numbers.
                Suppose that the stationary policy R has the property

                            c i (R i ) − g +  p ij (R i )υ j ≤ υ i f or each i ∈ I.  (6.2.11)
                                        j∈I
                Then the long-run average cost of policy R satisfies
                                         g i (R) ≤ g,  i ∈ I,               (6.2.12)

                where the strict inequality sign holds in (6.2.12) for i = r when state r is recurrent
                under policy R and the strict inequality sign holds in (6.2.11) for i = r. The result
                is also true when the inequality signs in (6.2.11) and (6.2.12) are reversed.
                Proof  We first give an intuitive explanation of the theorem and next we give a
                formal proof. Suppose that a control cost of c i (a) − g is incurred each time the
                action a is chosen in state i, while a terminal cost of υ j is incurred when the
                control of the system is stopped and the system is left behind in state j. Then
                (6.2.11) states that controlling the system for one step according to rule R and
                stopping next is preferable to stopping directly when the initial state is i. Since this
                property is true for each initial state, a repeated application of this property yields
                that controlling the system for m steps according to rule R and stopping after that
                is preferable to stopping directly. Thus, for each initial state i ∈ I,

                                             (m)
                          V m (i, R) − mg +  p  (R)υ j ≤ υ i ,  m = 1, 2, . . . .
                                             ij
                                         j∈I
                Dividing both sides of this inequality by m and letting m → ∞, we get (6.2.12).
                Next we give a formal proof that yields the result with the strict inequality sign
                as well. The proof is first given under the assumption that the Markov chain {X n }
                associated with policy R is unichain. Then this Markov chain has a unique equi-
                librium distribution {π j (R), j ∈ I}, where π j (R) > 0 only if state j is recurrent
                under policy R. Multiply both sides of (6.2.11) by π i (R) and sum over i. This
                gives

                           π i (R)c i (R i ) − g +  π i (R)  p ij (R i )υ j ≤  π i (R)υ i ,
                        i∈I                i∈I     j∈I           i∈I
   244   245   246   247   248   249   250   251   252   253   254