Page 286 - A First Course In Stochastic Models
P. 286
280 SEMI-MARKOV DECISION PROCESSES
7.1 THE SEMI-MARKOV DECISION MODEL
Consider a dynamic system whose state is reviewed at random epochs. At those
epochs a decision has to be made and costs are incurred as a consequence of the
decision made. The set of possible states is denoted by I. For each state i ∈ I, a
set A(i) of possible actions is available. It is assumed that the state space I and the
action sets A(i), i ∈ I are finite. This controlled dynamic system is called a semi-
Markov decision process when the following Markovian properties are satisfied: if
at a decision epoch the action a is chosen in state i, then the time until the next
decision epoch and the state at that epoch depend only on the present state i and
the subsequently chosen action a and are thus independent of the past history of the
system. Also, the costs incurred until the next decision epoch depend only on the
present state and the action chosen in that state. We note that in specific problems
the state occurring at the next transition will often depend on the time until that
transition. Also, the costs usually consist of lump costs incurred at the decision
epochs and rate costs incurred continuously in time. As an example, consider a
single-product inventory system in which the demand process is described by a
Poisson process and the inventory position can be replenished at any time. In this
example the decision epochs are the demand epochs and they occur randomly in
time. The decision is whether or not to raise the inventory position after a demand
has occurred. The costs typically consist of fixed replenishment costs and holding
costs that are incurred continuously in time.
The long-run average cost per time unit
The long-run average cost per time unit is taken as the optimality criterion. For this
criterion the semi-Markov decision model is in fact determined by the following
characteristics:
p ij (a) = the probability that at the next decision epoch the system will be in
state j if action a is chosen in the present state i,
τ i (a) = the expected time until the next decision epoch if action a is chosen
in the present state i,
c i (a) = the expected costs incurred until the next decision epoch if action a
is chosen in the present state i.
It is assumed that τ i (a) > 0 for all i ∈ I and a ∈ A(i). As before, a stationary
policy R is a rule which adds to each state i a single action R i ∈ A(i) and
always prescribes to take this action whenever the system is observed in state i at
a decision epoch. Since the state space is finite, it can be shown that under each
stationary policy the number of decisions made in any finite time interval is finite
with probability 1. We omit the proof of this result. Let
X n = the state of the system at the nth decision epoch.
Then it follows that under a stationary policy R the embedded stochastic process
{X n } is a discrete-time Markov chain with one-step transition probabilities p ij (R i ).