Page 251 - A First Course In Stochastic Models
P. 251

244             DISCRETE-TIME MARKOV DECISION PROCESSES

                Note, as a consequence of (6.3.1), the normalization

                                             w r (R) = 0.
                In accordance with the heuristic result (6.2.9), the next theorem shows that the
                average cost g = g(R) and the relative values υ i = w i (R), i ∈ I satisfy a system
                of linear equations.

                Theorem 6.3.1 Let R be a given stationary policy such that the associated Markov
                chain {X n } has no two disjoint closed sets. Then

                (a) The average cost g(R) and the relative values w i (R), i ∈ I, satisfy the following
                   system of linear equations in the unknowns g and υ i , i ∈ I:


                                  υ i = c i (R i ) − g +  p ij (R i )υ j ,  i ∈ I.  (6.3.2)
                                                  j∈I

                (b) Let the numbers g and υ i , i ∈ I, be any solution to (6.3.2). Then

                                               g = g(R)

                   and, for some constant c,
                                         υ i = w i (R) + c,  i ∈ I.


                (c) Let s be an arbitrarily chosen state. Then the linear equations (6.3.2) together
                   with the normalization equation υ s = 0 have a unique solution.

                Proof  (a) By conditioning on the next state following the initial state i, it can be
                seen that

                               T i (R) = 1 +  p ij (R i )T j (R),  i ∈ I,
                                          j =r

                               K i (R) = c i (R i ) +  p ij (R i )K j (R),  i ∈ I.
                                              j =r

                This implies that

                   K i (R) − g(R)T i (R) = c i (R i ) − g(R) +  p ij (R i ){K j (R) − g(R)T j (R)}.
                                                    j =r
                Hence, by w r (R) = 0, we find


                           w i (R) = c i (R i ) − g(R) +  p ij (R i )w j (R),  i ∈ I.
                                                 j∈I
   246   247   248   249   250   251   252   253   254   255   256