Page 257 - A First Course In Stochastic Models
P. 257

250             DISCRETE-TIME MARKOV DECISION PROCESSES

                Step 2 (policy improvement). The test quantity T i (a, R) has the values

                     T 2 (0, R (1) ) = 5.6410, T 2 (1, R (1) ) = 7.0000, T 3 (0, R (1) ) = 7.4359,

                     T 3 (1, R (1) ) = 7.0000, T 4 (0, R (1) ) = 9.4872, T 4 (1, R (1) ) = 5.0000.
                This yields the new policy R (2)  = (0, 0, 1, 1, 2, 2) by choosing for each state i the
                action a that minimizes T i (a, R (1) ).
                Step 3 (convergence test). The new policy R (2)  is different from the previous policy
                R (1)  and hence another iteration is performed.
                Iteration 2
                Step 1 (value determination). The average cost and the relative values of policy
                R (2)  = (0, 0, 1, 1, 2, 2) are computed by solving the linear equations

                              v 1 = 0 − g + 0.9v 1 + 0.1v 2

                              v 2 = 0 − g + 0.8v 2 + 0.1v 3 + 0.05v 4 + 0.05v 5
                              v 3 = 7 − g + v 1
                              v 4 = 5 − g + v 1
                              v 5 = 10 − g + v 6
                              v 6 = 0 − g + v 1
                              v 6 = 0.

                The solution of these linear equations is given by

                  g(R (2) ) = 0.4462, v 1 (R (2) ) = 0.4462, v 2 (R (2) ) = 4.9077, v 3 (R (2) ) = 7.000,
                     (2)             (2)              (2)
                 v 4 (R  ) = 5.0000, v 5 (R  ) = 9.5538, v 6 (R  ) = 0.
                Step 2 (policy improvement). The test quantity T i (a, R (2) ) has the values

                     T 2 (0, R (2) ) = 4.9077, T 2 (1, R (2) ) = 7.0000, T 3 (0, R (2) ) = 6.8646,
                     T 3 (1, R (2) ) = 7.0000, T 4 (0, R (2) ) = 6.8307, T 4 (1, R (2) ) = 5.0000.

                This yields the new policy R (3)  = (0, 0, 0, 1, 2, 2).
                Step 3 (convergence test). The new policy R (3)  is different from the previous policy
                R (2)  and hence another iteration is performed.
                Iteration 3
                Step 1 (value determination). The average cost and the relative values of policy
                R (3)  = (0, 0, 0, 1, 2, 2) are computed by solving the linear equations
   252   253   254   255   256   257   258   259   260   261   262