Page 248 - A First Course In Stochastic Models

P. 248

THE POLICY-IMPROVEMENT IDEA 241

with V 0 (i, R) = 0. This equation follows by conditioning on the next state that
occurs when action a = R i is made in state i when n decision epochs are to go. A
cost c i (R i ) is incurred at the ﬁrst decision epoch and the total expected cost over
the remaining n − 1 decision epochs is V n−1 (j, R) when the next state is j. By
substituting the asymptotic expansion (6.2.8) in the recursion equation, we ﬁnd,
after cancelling out common terms,

g(R) + υ i (R) ≈ c i (R i ) + p ij (R i )υ j (R), i ∈ I. (6.2.9)
j∈I
The intuitive idea behind the procedure for improving the given policy R is to
consider the following difference in costs:
(i, a, R) = the difference in total expected costs over an inﬁnitely long period
of time by taking ﬁrst action a and next using policy R rather
than using policy R from scratch when the initial state is i.
This difference is equal to zero when action a = R i is chosen. We wish to make
the difference (i, a, R) as negative as possible. This difference is given by



(i, a, R) = lim  c i (a) + p ij (a)V n−1 (j, R)
n→∞
j∈I


−{c i (R i ) + p ij (R i )V n−1 (j, R)} .

j∈I
Substituting (6.2.8) into the expression between brackets, we ﬁnd that for large n
this expression is approximately equal to

c i (a) + p ij (a)v j (R) − (n − 1)g(R)
j∈I
 
 

− c i (R i ) + p ij (R i )v j (R) − (n − 1)g(R) .
 
j∈I
This gives

(i, a, R) ≈ c i (a) + p ij (a)v j (R) − c i (R i ) − p ij (R i )v j (R).
j∈I j∈I
Thus, by using (6.2.9),

(i, a, R) ≈ c i (a) + p ij (a)v j (R) − g(R) − v i (R).
j∈I

243 244 245 246 247 248 249 250 251 252 253