Page 271 - A First Course In Stochastic Models

P. 271

264 DISCRETE-TIME MARKOV DECISION PROCESSES

Markov decision model are deﬁned by
I = I,
A(i) = A(i), i ∈ I,

c i (a) = c i (a), a ∈ A(i) and i ∈ I,

τp ij (a), j = i, a ∈ A(i) and i ∈ I,
p ij (a) =
τp ij (a) + 1 − τ, j = i, a ∈ A(i) and i ∈ I.
For each stationary policy, the associated Markov chain {X n } in the perturbed model
is aperiodic. It is not difﬁcult to verify that for each stationary policy the average
cost per time unit in the perturbed model is the same as that in the original model.
For the unichain case this is an immediate consequence of the representation (6.2.7)
for the average cost and the fact that for each stationary policy the Markov chain
{X n } has the same equilibrium probabilities as the Markov chain {X n } in the origi-
nal model. For the multichain case, a similar argument can be used to show that the
two models are in fact equivalent. Thus the value-iteration algorithm can be applied
to the perturbed model in order to solve the original model. In speciﬁc problems
involving periodicities, the ‘optimal’ value of τ is usually not clear beforehand;
1
empirical investigations indicate that τ = is usually a satisfactory choice.
2
Modiﬁed value iteration with a dynamic relaxation factor

Value iteration does not have the fast convergence of policy iteration. The number
of iterations required by the value-iteration algorithm is problem dependent and
increases when the number of problem states gets larger. Also, the tolerance number
ε in the stopping criterion affects the number of iterations required. The stopping
criterion should be based on the lower and upper bounds m n and M n but not on
any repetitive behaviour of the generated policies R(n).
The convergence rate of value iteration can often be accelerated by using a
relaxation factor, such as in successive overrelaxation for solving a single system
of linear equations. Then at the nth iteration a new approximation to the value
function V n (i) is obtained by using both the previous values V n−1 (i) and the
residuals V n (i)−V n−1 (i). It is possible to select dynamically a relaxation factor and
thus avoid the experimental determination of the best value of a ﬁxed relaxation
factor. The following modiﬁcation of the standard value-iteration algorithm can
be formulated. Steps 0, 1, 2 and 3 are as before, while step 4 of the standard
value-iteration algorithm is modiﬁed as follows.
Step 4(a). Determine the states u and v such that

V n (u) − V n−1 (u) = m n and V n (v) − V n−1 (v) = M n
and compute the relaxation factor

M n − m n
ω = ,
M n − m n + {p uj (R u ) − p vj (R v )}{V n (j) − V n−1 (j)}
j∈I

266 267 268 269 270 271 272 273 274 275 276