Page 245 - A First Course In Stochastic Models

P. 245

238 DISCRETE-TIME MARKOV DECISION PROCESSES

Average cost for a given stationary policy
Fix a stationary policy R. Under policy R each time the action a = R i is taken
whenever the system is in state i at a decision epoch. The process {X n } describing
the state of the system at the decision epochs is a Markov chain with one-step
transition probabilities p ij (R i ), i, j ∈ I when policy R is used. Denote the n-step
transition probabilities of this Markov chain by
(n)
p (R) = P {X n = j | X 0 = i}, i, j ∈ I and n = 1, 2, . . . .
ij
(1)
Note that p (R) = p ij (R i ). By the equations (3.2.1),
ij
(n) (n−1)
p (R) = p (R)p kj (R k ), n = 1, 2, . . . , (6.2.1)
ij ik
k∈I
(0) (0)
where p (R) = 1 for j = i and p (R) = 0 for j = i. Also, deﬁne the expected
ij ij
cost function V n (i, R) by
V n (i, R) = the total expected costs over the ﬁrst n decision epochs
when the initial state is i and policy R is used.

Obviously, we have
n−1
(t)
V n (i, R) = p (R)c j (R j ), (6.2.2)
ij
t=0 j∈I
Next we deﬁne the average cost function g i (R) by
1
g i (R) = lim V n (i, R), i ∈ I. (6.2.3)
n→∞ n
This limit exists by Theorem 3.3.1 and represents the long-run average expected
cost per time unit when the system is controlled by policy R and the initial
state is i. A state i is said to be recurrent under policy R if the system ulti-
mately returns to state i with probability 1 when the system starts in state i
and policy R is used; see Section 3.2.3. Otherwise, state i is said to be tran-
sient under policy R. If state i is recurrent under policy R, then g i (R) allows for
the stronger interpretation
the long-run actual average cost per time unit = g i (R) (6.2.4)

with probability 1 when the initial state is i and policy R is used. This is a
direct consequence of the theory for ﬁnite-state Markov chains. For the Markov
chain {X n } corresponding to policy R, the state space can be uniquely split up
into a ﬁnite number of disjoint irreducible sets of recurrent states and a (possibly
empty) set of transient states; see Section 3.5.1. Denote the recurrent subclasses by
I 1 (R), . . . , I f (R) and the set of transient states by T (R). Since the system cannot

240 241 242 243 244 245 246 247 248 249 250