Page 286 - A First Course In Stochastic Models
P. 286

280                 SEMI-MARKOV DECISION PROCESSES

                         7.1   THE SEMI-MARKOV DECISION MODEL
                Consider a dynamic system whose state is reviewed at random epochs. At those
                epochs a decision has to be made and costs are incurred as a consequence of the
                decision made. The set of possible states is denoted by I. For each state i ∈ I, a
                set A(i) of possible actions is available. It is assumed that the state space I and the
                action sets A(i), i ∈ I are finite. This controlled dynamic system is called a semi-
                Markov decision process when the following Markovian properties are satisfied: if
                at a decision epoch the action a is chosen in state i, then the time until the next
                decision epoch and the state at that epoch depend only on the present state i and
                the subsequently chosen action a and are thus independent of the past history of the
                system. Also, the costs incurred until the next decision epoch depend only on the
                present state and the action chosen in that state. We note that in specific problems
                the state occurring at the next transition will often depend on the time until that
                transition. Also, the costs usually consist of lump costs incurred at the decision
                epochs and rate costs incurred continuously in time. As an example, consider a
                single-product inventory system in which the demand process is described by a
                Poisson process and the inventory position can be replenished at any time. In this
                example the decision epochs are the demand epochs and they occur randomly in
                time. The decision is whether or not to raise the inventory position after a demand
                has occurred. The costs typically consist of fixed replenishment costs and holding
                costs that are incurred continuously in time.

                The long-run average cost per time unit
                The long-run average cost per time unit is taken as the optimality criterion. For this
                criterion the semi-Markov decision model is in fact determined by the following
                characteristics:
                  p ij (a) = the probability that at the next decision epoch the system will be in
                          state j if action a is chosen in the present state i,
                   τ i (a) = the expected time until the next decision epoch if action a is chosen
                          in the present state i,
                   c i (a) = the expected costs incurred until the next decision epoch if action a
                          is chosen in the present state i.
                  It is assumed that τ i (a) > 0 for all i ∈ I and a ∈ A(i). As before, a stationary
                policy R is a rule which adds to each state i a single action R i ∈ A(i) and
                always prescribes to take this action whenever the system is observed in state i at
                a decision epoch. Since the state space is finite, it can be shown that under each
                stationary policy the number of decisions made in any finite time interval is finite
                with probability 1. We omit the proof of this result. Let

                           X n = the state of the system at the nth decision epoch.
                Then it follows that under a stationary policy R the embedded stochastic process
                {X n } is a discrete-time Markov chain with one-step transition probabilities p ij (R i ).
   281   282   283   284   285   286   287   288   289   290   291