Page 253 - A First Course In Stochastic Models
P. 253

246             DISCRETE-TIME MARKOV DECISION PROCESSES

                together with υ s = 0 for some s must have a solution. In view of assertion (b),
                this solution is unique. This completes the proof of the theorem.


                Interpretation of the relative values
                The equations (6.3.2) are referred to as the value-determination equations. The
                relative value function v i , i ∈ I is unique up to an additive constant. The particular
                solution (6.3.1) can be interpreted as the total expected costs incurred until the
                first return to state r when policy R is used and the one-step costs are given by
                 ′
                c (a) = c i (a)−g with g = g(R). If the Markov chain {X n } associated with policy
                 i
                R is aperiodic, two other interpretations can be given to the relative value function.
                The first interpretation is that, for any two states i, j ∈ I,
                       v i − v j = the difference in total expected costs over an infinitely
                               long period of time by starting in state i rather than in
                               state j when using policy R.

                In other words, v i − v j is the maximum amount that a rational person is willing
                to pay to start the system in state j rather than in state i when the system is
                controlled by rule R. This interpretation is an easy consequence of (6.3.3). Using the
                                                                             (m)
                assumption that the Markov chain {X n } is aperiodic, we have that lim m→∞ p  (R)
                                                                             ij
                exists. Moreover this limit is independent of the initial state i, since R is unichain.
                Thus, by (6.3.3),

                                v i = lim {V m (i, R) − mg} +  π j (R)v j .  (6.3.5)
                                    m→∞
                                                        j∈I
                This implies that v i − v j = lim m→∞ {V m (i, R) − V m (j, R)}, yielding the above
                interpretation. A special interpretation applies to the relative value function v i ,
                i ∈ I with the property     π j (R)v j = 0. Since the relative value function is
                                      j∈I
                unique up to an additive constant, there is a unique relative value function with
                this property. Denote this relative value function by h i , i ∈ I. Then, by (6.3.5),

                                      h i = lim {V m (i, R) − mg}.           (6.3.6)
                                          m→∞
                The bias h i can also be interpreted as the difference in total expected costs between
                the system whose initial state is i and the system whose initial state is distributed
                according to the equilibrium distribution {π j (R), j ∈ I} when both systems are
                controlled by policy R. The latter system is called the stationary system. This
                system has the property that at any decision epoch the state is distributed as {π j (R)};
                see Section 3.3.2. Thus, for the stationary system, the expected cost incurred at any

                decision epoch equals  j∈I j (R j )π j (R) being the average cost g = g(R) of policy
                                       c
                R. Consequently, in the stationary system the total expected costs over the first m
                decision epochs equals mg. This gives the above interpretation of the bias h i .
   248   249   250   251   252   253   254   255   256   257   258