Page 287 - A First Course In Stochastic Models
P. 287

THE SEMI-MARKOV DECISION MODEL                 281

                Define the random variable Z (t) by
                           Z(t) = the total costs incurred up to time t,  t ≥ 0.

                Fix now a stationary policy R. Denote by E i,R the expectation operator when the
                initial state X 0 = i and policy R is used. Then the limit
                                                  1
                                      g i (R) = lim  E i,R [Z(t)]
                                              t→∞ t
                exists for all i ∈ I. This result can be proved by using the renewal-reward theorem
                in Section 2.2. The details are omitted. Just as in the discrete-time model, we
                can give a stronger interpretation to the average cost g i (R). If the initial state i
                is recurrent under policy R, then the long-run actual average cost per time unit
                equals g i (R) with probability 1. If the Markov chain {X n } associated with policy
                R has no two disjoint closed sets, the average cost g i (R) does not depend on the
                initial state X 0 = i.

                Theorem 7.1.1 Suppose that the embedded Markov chain {X n } associated with
                policy R has no two disjoint closed sets. Then
                                      Z(t)
                                  lim     = g(R) with probability 1          (7.1.1)
                                  t→∞  t
                for each initial state X 0 = i, where the constant g(R) is given by


                                g(R) =    c j (R j )π j (R)/  τ j (R j )π j (R)
                                       j∈I           j∈I
                with {π j (R)} denoting the equilibrium distribution of the Markov chain {X n }.
                Proof  We give only a sketch of the proof of (7.1.1). The key to the proof of
                (7.1.1) is that
                          Z(t)        E(costs over the first m decision epochs)
                      lim     = lim                                          (7.1.2)
                      t→∞  t     m→∞ E(time over the first m decision epochs)
                with probability 1. To verify this relation, fix a recurrent state r and suppose that
                X 0 = r. Let a cycle be defined as the time elapsed between two consecutive
                transitions into state r. By the renewal-reward theorem in Section 2.2,

                                    Z(t)   E(costs induring one cycle)
                                 lim     =
                                t→∞   t      E(length of one cycle)
                with probability 1. By the expected-value version of the renewal-reward theorem,
                                  1
                              lim   E(costs over the first m decision epochs)
                             m→∞ m
                                         E(costs incurred during one cycle)
                                     =
                                       E(number of transitions in one cycle)
   282   283   284   285   286   287   288   289   290   291   292