Page 246 - A First Course In Stochastic Models
P. 246
THE POLICY-IMPROVEMENT IDEA 239
leave a closed set, the process {X n } restricted to any recurrent subclass I ℓ (R) is
a Markov chain itself with its own equilibrium distribution. Since the restricted
Markov chain on I ℓ (R) is irreducible, it follows from Theorem 3.3.3 that (6.2.4)
holds for i ∈ I ℓ (R), ℓ = 1, . . . , f . Moreover,
g i (R) = g j (R), i, j ∈ I ℓ (R).
Let g (ℓ) (R) denote the common value of g i (R) for i ∈ I ℓ (R). For a transient initial
state i, the long-run average cost per time unit is a random variable. This ran-
dom variable assumes the value g (ℓ) (R) with probability f (ℓ) (R), where f (ℓ) (R)
i i
is the probability that the system is ultimately absorbed in the recurrent sub-
class I ℓ (R) when the initial state is i and policy R is used. Obviously, g i (R) =
f (ℓ) (ℓ)
ℓ=1 g (R)f i (R) for i ∈ T (R).
The above technical discussion involves rather heavy notation and might be
intimidating for some readers. This discussion greatly simplifies when the Markov
chain {X n } corresponding to policy R is unichain as is mostly the case in practi-
cal situations. The Markov chain is said to be unichain if it has no two disjoint
closed sets. In the unichain case the Markov chain {X n } has a unique equilibrium
distribution {π j (R), j ∈ I}. For any j ∈ I,
m
1 (n)
lim p (R) = π j (R), (6.2.5)
m→∞ m ij
n=1
independently of the initial state i. The π j (R) are the unique solution to the system
of equilibrium equations
π j (R) = p ij (R i )π i (R), j ∈ I, (6.2.6)
i∈I
in conjunction with j∈I π j (R) = 1. By (6.2.2), (6.2.3) and (6.2.5),
g i (R) = g(R) for all i ∈ I
with
g(R) = c j (R j )π j (R). (6.2.7)
j∈I
We defined g i (R) as an average expected cost. For the unichain case, it follows
from renewal-reward theory that the long-run average actual cost per time unit
equals g(R) with probability 1 when policy R is used, independently of the initial
state.
In practical applications the Markov chain {X n } associated with an optimal sta-
tionary policy will typically be unichain. The reader might wonder why we are
still paying attention to the multichain case. The reason is that in some applications
non-optimal policies may have multiple recurrent subclasses and those policies may