Page 248 - A First Course In Stochastic Models
P. 248
THE POLICY-IMPROVEMENT IDEA 241
with V 0 (i, R) = 0. This equation follows by conditioning on the next state that
occurs when action a = R i is made in state i when n decision epochs are to go. A
cost c i (R i ) is incurred at the first decision epoch and the total expected cost over
the remaining n − 1 decision epochs is V n−1 (j, R) when the next state is j. By
substituting the asymptotic expansion (6.2.8) in the recursion equation, we find,
after cancelling out common terms,
g(R) + υ i (R) ≈ c i (R i ) + p ij (R i )υ j (R), i ∈ I. (6.2.9)
j∈I
The intuitive idea behind the procedure for improving the given policy R is to
consider the following difference in costs:
(i, a, R) = the difference in total expected costs over an infinitely long period
of time by taking first action a and next using policy R rather
than using policy R from scratch when the initial state is i.
This difference is equal to zero when action a = R i is chosen. We wish to make
the difference (i, a, R) as negative as possible. This difference is given by
(i, a, R) = lim c i (a) + p ij (a)V n−1 (j, R)
n→∞
j∈I
−{c i (R i ) + p ij (R i )V n−1 (j, R)} .
j∈I
Substituting (6.2.8) into the expression between brackets, we find that for large n
this expression is approximately equal to
c i (a) + p ij (a)v j (R) − (n − 1)g(R)
j∈I
− c i (R i ) + p ij (R i )v j (R) − (n − 1)g(R) .
j∈I
This gives
(i, a, R) ≈ c i (a) + p ij (a)v j (R) − c i (R i ) − p ij (R i )v j (R).
j∈I j∈I
Thus, by using (6.2.9),
(i, a, R) ≈ c i (a) + p ij (a)v j (R) − g(R) − v i (R).
j∈I