Page 247 - A First Course In Stochastic Models
P. 247

240             DISCRETE-TIME MARKOV DECISION PROCESSES

                show up in intermediate steps of the algorithms for computing an optimal policy.
                However, in most practical applications the Markov chain {X n } is unichain for
                each stationary policy.

                Policy-improvement idea

                                  ∗
                A stationary policy R is said to be average cost optimal if
                                               ∗
                                           g i (R ) ≤ g i (R)
                for each stationary policy R, uniformly in the initial state i. It is stated without
                proof that an average cost optimal stationary policy R always exists. Moreover,
                                                             ∗
                       ∗
                policy R is not only optimal among the class of stationary policies but it is also
                optimal among the class of all conceivable policies.
                  In most applications it is computationally not feasible to find an average cost
                optimal policy by computing the average cost for each stationary policy separately.
                For example, if the number of states is N and there are two actions in each
                state, then the number of possible stationary policies is 2 N  and this number grows
                quickly beyond any practical bound. However, several algorithms can be given
                that lead in an effective way to an average cost optimal policy. Policy iteration
                and value iteration are the most widely used algorithms to compute an average
                cost optimal policy. The first method works on the policy space and generates
                a sequence of improved policies, whereas the second method approximates the
                minimal average cost through a sequence of value functions. In both methods a
                key role is played by the so-called relative values. The relative values are the basis
                for a powerful improvement step. The improvement step is motivated through a
                heuristic discussion of the relative values of a given policy R. In the next section
                a rigorous treatment will be presented for the relative values.
                  Let us fix any stationary policy R. It is assumed that the Markov chain {X n }
                associated with policy R has no two disjoint closed sets. Then the average cost
                g i (R) = g(R), independently of the initial state i ∈ I. The starting point is the
                obvious relation lim n→∞ V n (i, R)/n = g(R) for all i, where V n (i, R) denotes the
                total expected costs over the first n decision epochs when the initial state is i and
                policy R is used. This relation motivates the heuristic assumption that bias values
                v i (R), i ∈ I, exist such that, for each i ∈ I,

                                 V n (i, R) ≈ ng(R) + υ i (R)  for n large.  (6.2.8)

                Note that υ i (R) − υ j (R) ≈ V n (i, R) − V n (j, R) for n large. Thus υ i (R) − υ j (R)
                measures the difference in total expected costs when starting in state i rather than
                in state j, given that policy R is followed. This explains the name of relative values
                for the υ i (R). We have the recursion equation


                        V n (i, R) = c i (R i ) +  p ij (R i )V n−1 (j, R),  n ≥ 1 and i ∈ I
                                         j∈I
   242   243   244   245   246   247   248   249   250   251   252