Page 245 - A First Course In Stochastic Models
P. 245

238             DISCRETE-TIME MARKOV DECISION PROCESSES

                Average cost for a given stationary policy
                Fix a stationary policy R. Under policy R each time the action a = R i is taken
                whenever the system is in state i at a decision epoch. The process {X n } describing
                the state of the system at the decision epochs is a Markov chain with one-step
                transition probabilities p ij (R i ), i, j ∈ I when policy R is used. Denote the n-step
                transition probabilities of this Markov chain by
                         (n)
                        p  (R) = P {X n = j | X 0 = i},  i, j ∈ I and n = 1, 2, . . . .
                         ij
                         (1)
                Note that p  (R) = p ij (R i ). By the equations (3.2.1),
                         ij
                              (n)         (n−1)
                             p  (R) =    p    (R)p kj (R k ),  n = 1, 2, . . . ,  (6.2.1)
                              ij          ik
                                      k∈I
                       (0)                   (0)
                where p  (R) = 1 for j = i and p  (R) = 0 for j  = i. Also, define the expected
                      ij                    ij
                cost function V n (i, R) by
                      V n (i, R) = the total expected costs over the first n decision epochs
                               when the initial state is i and policy R is used.

                Obviously, we have
                                             n−1
                                                    (t)
                                   V n (i, R) =    p (R)c j (R j ),          (6.2.2)
                                                    ij
                                             t=0 j∈I
                Next we define the average cost function g i (R) by
                                               1
                                   g i (R) = lim  V n (i, R),  i ∈ I.        (6.2.3)
                                           n→∞ n
                This limit exists by Theorem 3.3.1 and represents the long-run average expected
                cost per time unit when the system is controlled by policy R and the initial
                state is i. A state i is said to be recurrent under policy R if the system ulti-
                mately returns to state i with probability 1 when the system starts in state i
                and policy R is used; see Section 3.2.3. Otherwise, state i is said to be tran-
                sient under policy R. If state i is recurrent under policy R, then g i (R) allows for
                the stronger interpretation
                           the long-run actual average cost per time unit = g i (R)  (6.2.4)

                with probability 1 when the initial state is i and policy R is used. This is a
                direct consequence of the theory for finite-state Markov chains. For the Markov
                chain {X n } corresponding to policy R, the state space can be uniquely split up
                into a finite number of disjoint irreducible sets of recurrent states and a (possibly
                empty) set of transient states; see Section 3.5.1. Denote the recurrent subclasses by
                I 1 (R), . . . , I f (R) and the set of transient states by T (R). Since the system cannot
   240   241   242   243   244   245   246   247   248   249   250