Page 250 - A First Course In Stochastic Models
P. 250

THE RELATIVE VALUE FUNCTION                  243

                where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
                for some i with π i (R) > 0. Interchanging the order of summation in the above
                inequality and using (6.2.6) and (6.2.7) with R replaced by R, we find

                                 g(R) − g +   π j (R)υ j ≤  π i (R)υ i ,
                                           j∈I          i∈I

                where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
                for some i with π i (R) > 0. This verifies the theorem for the case of a unichain
                policy R. Next it is easy to establish the theorem for the case of a multichain
                policy R. Letting I 1 (R), . . . , I f (R) denote the recurrent subclasses of the Markov
                chain associated with policy R, the above proof shows that for any ℓ = 1, . . . , f
                the inequality (6.2.12) holds for all i ∈ I ℓ (R). The proof of the theorem is next
                completed by noting that for each transient state i the average expected cost g i (R)
                is a linear combination of the average costs on the recurrent subclasses.


                            6.3  THE RELATIVE VALUE FUNCTION
                In Section 6.2 we introduced in a heuristic way the relative values for a given
                stationary policy R. In this section we give a rigorous treatment. This will be done
                for the case of a unichain policy R. Let r be any recurrent state of policy R. In
                view of the unichain assumption, the Markov chain {X n } associated with policy R
                will visit state r after finitely many transitions, regardless of the initial state. Thus
                we can define, for each state i ∈ I,
                       T i (R) = the expected time until the first return to state r when
                               starting in state i and using policy R.

                In particular, letting a cycle be the time elapsed between two consecutive visits to
                the regeneration state r under policy R, we have that T r (R) is the expected length
                of a cycle. Also define, for each i ∈ I,
                      K i (R) = the expected costs incurred until the first return to state r
                              when starting in state i and using policy R.

                We use the convention that K i (R) includes the cost incurred when starting in state
                i but excludes the cost incurred when returning to state r. By the theory of renewal-
                reward processes, the average cost per time unit equals the expected costs incurred
                in one cycle divided by the expected length of one cycle and so

                                                  K r (R)
                                           g(R) =      .
                                                  T r (R)
                Next we define the particular relative value function
                                 w i (R) = K i (R) − g(R)T i (R),  i ∈ I.    (6.3.1)
   245   246   247   248   249   250   251   252   253   254   255