Page 290 - A First Course In Stochastic Models
P. 290
284 SEMI-MARKOV DECISION PROCESSES
and so, by Theorem 7.1.1, g(R) = g(R). Thus we can conclude that an average
cost optimal policy in the semi-Markov model can be obtained by solving an
appropriate discrete-time Markov decision model. This conclusion is particularly
useful with respect to the value-iteration algorithm. In applying value iteration to
the transformed model, it is no restriction to assume that for each stationary policy
the associated Markov chain {X n } is aperiodic. By choosing the constant τ strictly
less than min i,a τ i (a), we always have p ii (a) > 0 for all i, a and thus the required
aperiodicity.
7.2 ALGORITHMS FOR AN OPTIMAL POLICY
In this section we outline how the algorithms for the discrete-time Markov decision
model can be extended to the semi-Markov decision model.
Policy-iteration algorithm
The policy-iteration algorithm will be described under the unichain assumption.
This assumption requires that for each stationary policy the embedded Markov
chain {X n } has no two disjoint closed sets. By data transformation, it is directly ver-
ified that the value-determination equations (6.3.2) for a given stationary policy R
remain valid provided that we replace g by gτ i (R i ). The policy-improvement pro-
cedure from Theorem 6.2.1 also remains valid when we replace g by gτ i (R i ).
Suppose that g(R) and υ i (R), i ∈ I, are the average cost and the relative values
of a stationary policy R. If a stationary policy R is constructed such that, for each
state i ∈ I,
c i (R i ) − g(R)τ i (R i ) + p ij (R i )υ j (R) ≤ υ i (R), (7.2.1)
j∈I
then g(R) ≤ g(R). Moreover, g(R) < g(R) if the strict inequality sign holds in
(7.2.1) for some state i which is recurrent under R. These statements can be verified
by the same arguments as used in the second part of the proof of Theorem 6.2.1.
Under the unichain assumption, we can now formulate the following policy-
iteration algorithm:
Step 0 (initialization). Choose a stationary policy R.
Step 1 (value-determination step). For the current rule R, compute the average cost
g(R) and the relative values υ i (R), i ∈ I, as the unique solution to the linear
equations
υ i = c i (R i ) − gτ i (R i ) + p ij (R i )υ j , i ∈ I,
j∈I
υ s = 0,
where s is an arbitrarily chosen state.