Page 250 - A First Course In Stochastic Models

P. 250

THE RELATIVE VALUE FUNCTION 243

where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
for some i with π i (R) > 0. Interchanging the order of summation in the above
inequality and using (6.2.6) and (6.2.7) with R replaced by R, we ﬁnd

g(R) − g + π j (R)υ j ≤ π i (R)υ i ,
j∈I i∈I

where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
for some i with π i (R) > 0. This veriﬁes the theorem for the case of a unichain
policy R. Next it is easy to establish the theorem for the case of a multichain
policy R. Letting I 1 (R), . . . , I f (R) denote the recurrent subclasses of the Markov
chain associated with policy R, the above proof shows that for any ℓ = 1, . . . , f
the inequality (6.2.12) holds for all i ∈ I ℓ (R). The proof of the theorem is next
completed by noting that for each transient state i the average expected cost g i (R)
is a linear combination of the average costs on the recurrent subclasses.

6.3 THE RELATIVE VALUE FUNCTION
In Section 6.2 we introduced in a heuristic way the relative values for a given
stationary policy R. In this section we give a rigorous treatment. This will be done
for the case of a unichain policy R. Let r be any recurrent state of policy R. In
view of the unichain assumption, the Markov chain {X n } associated with policy R
will visit state r after ﬁnitely many transitions, regardless of the initial state. Thus
we can deﬁne, for each state i ∈ I,
T i (R) = the expected time until the ﬁrst return to state r when
starting in state i and using policy R.

In particular, letting a cycle be the time elapsed between two consecutive visits to
the regeneration state r under policy R, we have that T r (R) is the expected length
of a cycle. Also deﬁne, for each i ∈ I,
K i (R) = the expected costs incurred until the ﬁrst return to state r
when starting in state i and using policy R.

We use the convention that K i (R) includes the cost incurred when starting in state
i but excludes the cost incurred when returning to state r. By the theory of renewal-
reward processes, the average cost per time unit equals the expected costs incurred
in one cycle divided by the expected length of one cycle and so

K r (R)
g(R) = .
T r (R)
Next we deﬁne the particular relative value function
w i (R) = K i (R) − g(R)T i (R), i ∈ I. (6.3.1)

245 246 247 248 249 250 251 252 253 254 255