Page 247 - A First Course In Stochastic Models
P. 247
240 DISCRETE-TIME MARKOV DECISION PROCESSES
show up in intermediate steps of the algorithms for computing an optimal policy.
However, in most practical applications the Markov chain {X n } is unichain for
each stationary policy.
Policy-improvement idea
∗
A stationary policy R is said to be average cost optimal if
∗
g i (R ) ≤ g i (R)
for each stationary policy R, uniformly in the initial state i. It is stated without
proof that an average cost optimal stationary policy R always exists. Moreover,
∗
∗
policy R is not only optimal among the class of stationary policies but it is also
optimal among the class of all conceivable policies.
In most applications it is computationally not feasible to find an average cost
optimal policy by computing the average cost for each stationary policy separately.
For example, if the number of states is N and there are two actions in each
state, then the number of possible stationary policies is 2 N and this number grows
quickly beyond any practical bound. However, several algorithms can be given
that lead in an effective way to an average cost optimal policy. Policy iteration
and value iteration are the most widely used algorithms to compute an average
cost optimal policy. The first method works on the policy space and generates
a sequence of improved policies, whereas the second method approximates the
minimal average cost through a sequence of value functions. In both methods a
key role is played by the so-called relative values. The relative values are the basis
for a powerful improvement step. The improvement step is motivated through a
heuristic discussion of the relative values of a given policy R. In the next section
a rigorous treatment will be presented for the relative values.
Let us fix any stationary policy R. It is assumed that the Markov chain {X n }
associated with policy R has no two disjoint closed sets. Then the average cost
g i (R) = g(R), independently of the initial state i ∈ I. The starting point is the
obvious relation lim n→∞ V n (i, R)/n = g(R) for all i, where V n (i, R) denotes the
total expected costs over the first n decision epochs when the initial state is i and
policy R is used. This relation motivates the heuristic assumption that bias values
v i (R), i ∈ I, exist such that, for each i ∈ I,
V n (i, R) ≈ ng(R) + υ i (R) for n large. (6.2.8)
Note that υ i (R) − υ j (R) ≈ V n (i, R) − V n (j, R) for n large. Thus υ i (R) − υ j (R)
measures the difference in total expected costs when starting in state i rather than
in state j, given that policy R is followed. This explains the name of relative values
for the υ i (R). We have the recursion equation
V n (i, R) = c i (R i ) + p ij (R i )V n−1 (j, R), n ≥ 1 and i ∈ I
j∈I