Page 291 - A First Course In Stochastic Models
P. 291
ALGORITHMS FOR AN OPTIMAL POLICY 285
Step 2 (policy-improvement step). For each state i ∈ I, determine an action a i
yielding the minimum in
min c i (a) − g(R)τ i (a) + p ij (a)υ j (R) .
a∈A(i)
j∈I
The new stationary policy R is obtained by choosing R i = a i for all i ∈ I with the
convention that R i is chosen equal to the old action R i when this action minimizes
the policy-improvement quantity.
Step 3 (convergence test). If the new policy R = R, then the algorithm is stopped
with policy R. Otherwise, go to step 1 with R replaced by R.
In the same way as for the discrete-time Markov decision model, it can be
shown that the algorithm converges in a finite number of iterations to an average
cost optimal policy. Also, as a consequence of the convergence of the algorithm,
∗
∗
there exist numbers g and υ satisfying the average cost optimality equation
i
υ = min c i (a) − g τ i (a) + p ij (a)υ j ∗ , i ∈ I. (7.2.2)
∗
∗
i
a∈A(i)
j∈I
The constant g is uniquely determined as the minimal average cost per time unit.
∗
Moreover, each stationary policy whose actions minimize the right-hand side of
(7.2.2) for all i ∈ I is average cost optimal. The proof of these statements is left
as an exercise for the reader.
Value-iteration algorithm
For the semi-Markov decision model the formulation of a value-iteration algorithm
is not straightforward. A recursion relation for the minimal expected costs over the
first n decision epochs does not take into account the non-identical transition times
and thus these costs cannot be related to the minimal average cost per time unit.
However, by the data transformation method from Section 7.1, we can convert the
semi-Markov decision model into a discrete-time Markov decision model such that
both models have the same average cost for each stationary policy. A value-iteration
algorithm for the original semi-Markov decision model is then implied by the value-
iteration algorithm for the transformed discrete-time Markov decision model. In the
discrete-time model it is no restriction to assume that all c i (a) = c i (a)/τ i (a) are
positive; otherwise, add a sufficiently large positive constant to each c i (a). The
following recursion method results for the semi-Markov decision model:
Step 0. Choose V 0 (i) such that 0 ≤ V 0 (i) ≤ min a {c i (a)/τ i (a)} for all i. Choose a
number τ with 0 < τ ≤ min i,a τ i (a). Let n := 1.