Page 246 - A First Course In Stochastic Models
P. 246

THE POLICY-IMPROVEMENT IDEA                  239

                leave a closed set, the process {X n } restricted to any recurrent subclass I ℓ (R) is
                a Markov chain itself with its own equilibrium distribution. Since the restricted
                Markov chain on I ℓ (R) is irreducible, it follows from Theorem 3.3.3 that (6.2.4)
                holds for i ∈ I ℓ (R), ℓ = 1, . . . , f . Moreover,

                                     g i (R) = g j (R), i, j ∈ I ℓ (R).

                Let g (ℓ) (R) denote the common value of g i (R) for i ∈ I ℓ (R). For a transient initial
                state i, the long-run average cost per time unit is a random variable. This ran-
                dom variable assumes the value g (ℓ) (R) with probability f (ℓ) (R), where f  (ℓ) (R)
                                                                i            i
                is the probability that the system is ultimately absorbed in the recurrent sub-
                class I ℓ (R) when the initial state is i and policy R is used. Obviously, g i (R) =
                  f   (ℓ)   (ℓ)
                  ℓ=1  g  (R)f i  (R) for i ∈ T (R).
                  The above technical discussion involves rather heavy notation and might be
                intimidating for some readers. This discussion greatly simplifies when the Markov
                chain {X n } corresponding to policy R is unichain as is mostly the case in practi-
                cal situations. The Markov chain is said to be unichain if it has no two disjoint
                closed sets. In the unichain case the Markov chain {X n } has a unique equilibrium
                distribution {π j (R), j ∈ I}. For any j ∈ I,

                                             m
                                          1     (n)
                                      lim      p  (R) = π j (R),             (6.2.5)
                                     m→∞ m      ij
                                            n=1
                independently of the initial state i. The π j (R) are the unique solution to the system
                of equilibrium equations

                                  π j (R) =  p ij (R i )π i (R),  j ∈ I,     (6.2.6)
                                          i∈I
                in conjunction with    j∈I  π j (R) = 1. By (6.2.2), (6.2.3) and (6.2.5),

                                      g i (R) = g(R)  for all i ∈ I

                with

                                       g(R) =    c j (R j )π j (R).          (6.2.7)
                                              j∈I
                We defined g i (R) as an average expected cost. For the unichain case, it follows
                from renewal-reward theory that the long-run average actual cost per time unit
                equals g(R) with probability 1 when policy R is used, independently of the initial
                state.
                  In practical applications the Markov chain {X n } associated with an optimal sta-
                tionary policy will typically be unichain. The reader might wonder why we are
                still paying attention to the multichain case. The reason is that in some applications
                non-optimal policies may have multiple recurrent subclasses and those policies may
   241   242   243   244   245   246   247   248   249   250   251