Page 266 - A First Course In Stochastic Models
P. 266
VALUE-ITERATION ALGORITHM 259
of the possible pitfalls of the Lagrangian approach, this approach may be quite
useful in practical applications having a specific structure.
6.6 VALUE-ITERATION ALGORITHM
The policy-iteration algorithm and the linear programming formulation both require
that in each iteration a system of linear equations of the same size as the state
space is solved. In general, this will be computationally burdensome for a large
state space and makes these algorithms computationally unattractive for large-scale
Markov decision problems. In this section we discuss an alternative algorithm
which avoids solving systems of linear equations but uses instead the recursive
solution approach from dynamic programming. This method is the value-iteration
algorithm which computes recursively a sequence of value functions approximating
the minimal average cost per time unit. The value functions provide lower and upper
bounds on the minimal average cost and under a certain aperiodicity condition
these bounds converge to the minimal average cost. The aperiodicity condition is
not restrictive, since it can be forced to hold by a simple data transformation. The
value-iteration algorithm endowed with these lower and upper bounds is in general
the best computational method for solving large-scale Markov decision problems.
This is even true in spite of the fact that the value-iteration algorithm does not
have the robustness of the policy-iteration algorithm: the number of iterations is
problem dependent and typically increases in the number of states of the problem
under consideration. Another important advantage of value iteration is that it is
usually easy to write a code for specific applications. By exploiting the structure of
the particular application one usually avoids computer memory problems that may
be encountered when using policy iteration. Value iteration is not only a powerful
method for controlled Markov chains, but it is also a useful tool to compute bounds
on performance measures in a single Markov chain; see Example 6.6.1.
In this section the value-iteration algorithm will be analysed under the weak
unichain assumption from Section 6.5. Under this assumption the minimal average
cost per time unit is independent of the initial state. Let
g = the minimal long-run average cost per time unit.
∗
The value-iteration algorithm computes recursively for n = 1, 2, . . . the value
function V n (i) from
V n (i) = min c i (a) + p ij (a)V n−1 (j) , i ∈ I, (6.6.1)
j∈I
a∈A(i)
starting with an arbitrarily chosen function V 0 (i), i ∈ I. The quantity V n (i) can
be interpreted as the minimal total expected costs with n periods left to the time
horizon when the current state is i and a terminal cost of V 0 (j) is incurred when
the system ends up at state j; see Denardo (1982) and Derman (1970) for a proof.