Page 250 - A First Course In Stochastic Models
P. 250
THE RELATIVE VALUE FUNCTION 243
where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
for some i with π i (R) > 0. Interchanging the order of summation in the above
inequality and using (6.2.6) and (6.2.7) with R replaced by R, we find
g(R) − g + π j (R)υ j ≤ π i (R)υ i ,
j∈I i∈I
where the strict inequality sign holds when the strict inequality sign holds in (6.2.11)
for some i with π i (R) > 0. This verifies the theorem for the case of a unichain
policy R. Next it is easy to establish the theorem for the case of a multichain
policy R. Letting I 1 (R), . . . , I f (R) denote the recurrent subclasses of the Markov
chain associated with policy R, the above proof shows that for any ℓ = 1, . . . , f
the inequality (6.2.12) holds for all i ∈ I ℓ (R). The proof of the theorem is next
completed by noting that for each transient state i the average expected cost g i (R)
is a linear combination of the average costs on the recurrent subclasses.
6.3 THE RELATIVE VALUE FUNCTION
In Section 6.2 we introduced in a heuristic way the relative values for a given
stationary policy R. In this section we give a rigorous treatment. This will be done
for the case of a unichain policy R. Let r be any recurrent state of policy R. In
view of the unichain assumption, the Markov chain {X n } associated with policy R
will visit state r after finitely many transitions, regardless of the initial state. Thus
we can define, for each state i ∈ I,
T i (R) = the expected time until the first return to state r when
starting in state i and using policy R.
In particular, letting a cycle be the time elapsed between two consecutive visits to
the regeneration state r under policy R, we have that T r (R) is the expected length
of a cycle. Also define, for each i ∈ I,
K i (R) = the expected costs incurred until the first return to state r
when starting in state i and using policy R.
We use the convention that K i (R) includes the cost incurred when starting in state
i but excludes the cost incurred when returning to state r. By the theory of renewal-
reward processes, the average cost per time unit equals the expected costs incurred
in one cycle divided by the expected length of one cycle and so
K r (R)
g(R) = .
T r (R)
Next we define the particular relative value function
w i (R) = K i (R) − g(R)T i (R), i ∈ I. (6.3.1)