Page 253 - A First Course In Stochastic Models
P. 253
246 DISCRETE-TIME MARKOV DECISION PROCESSES
together with υ s = 0 for some s must have a solution. In view of assertion (b),
this solution is unique. This completes the proof of the theorem.
Interpretation of the relative values
The equations (6.3.2) are referred to as the value-determination equations. The
relative value function v i , i ∈ I is unique up to an additive constant. The particular
solution (6.3.1) can be interpreted as the total expected costs incurred until the
first return to state r when policy R is used and the one-step costs are given by
′
c (a) = c i (a)−g with g = g(R). If the Markov chain {X n } associated with policy
i
R is aperiodic, two other interpretations can be given to the relative value function.
The first interpretation is that, for any two states i, j ∈ I,
v i − v j = the difference in total expected costs over an infinitely
long period of time by starting in state i rather than in
state j when using policy R.
In other words, v i − v j is the maximum amount that a rational person is willing
to pay to start the system in state j rather than in state i when the system is
controlled by rule R. This interpretation is an easy consequence of (6.3.3). Using the
(m)
assumption that the Markov chain {X n } is aperiodic, we have that lim m→∞ p (R)
ij
exists. Moreover this limit is independent of the initial state i, since R is unichain.
Thus, by (6.3.3),
v i = lim {V m (i, R) − mg} + π j (R)v j . (6.3.5)
m→∞
j∈I
This implies that v i − v j = lim m→∞ {V m (i, R) − V m (j, R)}, yielding the above
interpretation. A special interpretation applies to the relative value function v i ,
i ∈ I with the property π j (R)v j = 0. Since the relative value function is
j∈I
unique up to an additive constant, there is a unique relative value function with
this property. Denote this relative value function by h i , i ∈ I. Then, by (6.3.5),
h i = lim {V m (i, R) − mg}. (6.3.6)
m→∞
The bias h i can also be interpreted as the difference in total expected costs between
the system whose initial state is i and the system whose initial state is distributed
according to the equilibrium distribution {π j (R), j ∈ I} when both systems are
controlled by policy R. The latter system is called the stationary system. This
system has the property that at any decision epoch the state is distributed as {π j (R)};
see Section 3.3.2. Thus, for the stationary system, the expected cost incurred at any
decision epoch equals j∈I j (R j )π j (R) being the average cost g = g(R) of policy
c
R. Consequently, in the stationary system the total expected costs over the first m
decision epochs equals mg. This gives the above interpretation of the bias h i .