Page 266 - A First Course In Stochastic Models
P. 266

VALUE-ITERATION ALGORITHM                   259

                of the possible pitfalls of the Lagrangian approach, this approach may be quite
                useful in practical applications having a specific structure.



                             6.6  VALUE-ITERATION ALGORITHM
                The policy-iteration algorithm and the linear programming formulation both require
                that in each iteration a system of linear equations of the same size as the state
                space is solved. In general, this will be computationally burdensome for a large
                state space and makes these algorithms computationally unattractive for large-scale
                Markov decision problems. In this section we discuss an alternative algorithm
                which avoids solving systems of linear equations but uses instead the recursive
                solution approach from dynamic programming. This method is the value-iteration
                algorithm which computes recursively a sequence of value functions approximating
                the minimal average cost per time unit. The value functions provide lower and upper
                bounds on the minimal average cost and under a certain aperiodicity condition
                these bounds converge to the minimal average cost. The aperiodicity condition is
                not restrictive, since it can be forced to hold by a simple data transformation. The
                value-iteration algorithm endowed with these lower and upper bounds is in general
                the best computational method for solving large-scale Markov decision problems.
                This is even true in spite of the fact that the value-iteration algorithm does not
                have the robustness of the policy-iteration algorithm: the number of iterations is
                problem dependent and typically increases in the number of states of the problem
                under consideration. Another important advantage of value iteration is that it is
                usually easy to write a code for specific applications. By exploiting the structure of
                the particular application one usually avoids computer memory problems that may
                be encountered when using policy iteration. Value iteration is not only a powerful
                method for controlled Markov chains, but it is also a useful tool to compute bounds
                on performance measures in a single Markov chain; see Example 6.6.1.
                  In this section the value-iteration algorithm will be analysed under the weak
                unichain assumption from Section 6.5. Under this assumption the minimal average
                cost per time unit is independent of the initial state. Let
                           g = the minimal long-run average cost per time unit.
                            ∗
                The value-iteration algorithm computes recursively for n = 1, 2, . . . the value
                function V n (i) from

                                                             
                                                             

                           V n (i) = min  c i (a) +  p ij (a)V n−1 (j) ,  i ∈ I,  (6.6.1)
                                               j∈I
                                  a∈A(i)                     
                starting with an arbitrarily chosen function V 0 (i), i ∈ I. The quantity V n (i) can
                be interpreted as the minimal total expected costs with n periods left to the time
                horizon when the current state is i and a terminal cost of V 0 (j) is incurred when
                the system ends up at state j; see Denardo (1982) and Derman (1970) for a proof.
   261   262   263   264   265   266   267   268   269   270   271