Page 248 - A First Course In Stochastic Models
P. 248

THE POLICY-IMPROVEMENT IDEA                  241

                with V 0 (i, R) = 0. This equation follows by conditioning on the next state that
                occurs when action a = R i is made in state i when n decision epochs are to go. A
                cost c i (R i ) is incurred at the first decision epoch and the total expected cost over
                the remaining n − 1 decision epochs is V n−1 (j, R) when the next state is j. By
                substituting the asymptotic expansion (6.2.8) in the recursion equation, we find,
                after cancelling out common terms,


                            g(R) + υ i (R) ≈ c i (R i ) +  p ij (R i )υ j (R),  i ∈ I.  (6.2.9)
                                                 j∈I
                The intuitive idea behind the procedure for improving the given policy R is to
                consider the following difference in costs:
                  (i, a, R) = the difference in total expected costs over an infinitely long period
                           of time by taking first action a and next using policy R rather
                           than using policy R from scratch when the initial state is i.
                This difference is equal to zero when action a = R i is chosen. We wish to make
                the difference  (i, a, R) as negative as possible. This difference is given by

                                      

                        (i, a, R) = lim   c i (a) +  p ij (a)V n−1 (j, R)
                                  n→∞
                                              j∈I
                                                                         

                                            −{c i (R i ) +  p ij (R i )V n−1 (j, R)} .
                                                                         
                                                      j∈I
                Substituting (6.2.8) into the expression between brackets, we find that for large n
                this expression is approximately equal to


                           c i (a) +  p ij (a)v j (R) − (n − 1)g(R)
                                  j∈I
                                                                    
                                                                    

                                −  c i (R i ) +  p ij (R i )v j (R) − (n − 1)g(R) .
                                                                    
                                           j∈I
                This gives

                       (i, a, R) ≈ c i (a) +  p ij (a)v j (R) − c i (R i ) −  p ij (R i )v j (R).
                                       j∈I                    j∈I
                Thus, by using (6.2.9),


                            (i, a, R) ≈ c i (a) +  p ij (a)v j (R) − g(R) − v i (R).
                                             j∈I
   243   244   245   246   247   248   249   250   251   252   253