Page 267 - A First Course In Stochastic Models
P. 267

260             DISCRETE-TIME MARKOV DECISION PROCESSES

                Intuitively, one might expect that the one-step difference V n (i)−V n−1 (i) will come
                very close to the minimal average cost per time unit and that the stationary policy
                whose actions minimize the right-hand side of (6.6.1) for all i will be very close
                in cost to the minimal average cost. However, these matters appear to be rather
                subtle for the average cost criterion due to the effect of possible periodicities in the
                underlying decision processes. Before explaining this in more detail, we investigate
                an operator which is induced by the recursion equation (6.6.1). The operator T
                adds to each function v = (v i , i ∈ I) a function T v whose ith component (T v) i
                is defined by
                                                           
                                                           

                             (T v) i = min  c i (a) +  p ij (a)v j  ,  i ∈ I.  (6.6.2)
                                     a∈A(i)                
                                                  j∈I
                Note that (T v) i = V n (i) if v i = V n−1 (i), i ∈ I. The following theorem plays a
                key role in the value-iteration algorithm.

                Theorem 6.6.1 Suppose that the weak unichain assumption is satisfied. Let v =
                (v i ) be given. Define the stationary policy R(v) as a policy which adds to each state
                i ∈ I an action a = R i (v) that minimizes the right-hand side of (6.6.2). Then

                                            ∗
                           min {(T v) i − v i } ≤ g ≤ g s (R(v)) ≤ max {(T v) i − v i }  (6.6.3)
                           i∈I                            i∈I
                                   ∗
                for any s ∈ I, where g is the minimal long-run average cost per time unit and
                g s (R(v)) denotes the long-run average cost per time unit under policy R(v) when
                the initial state is s.
                Proof  To prove the first inequality, choose any stationary policy R. By the defi-
                nition of (T v) i , we have for any state i ∈ I that

                                (T v) i ≤ c i (a) +  p ij (a)v j ,  a ∈ A(i),  (6.6.4)
                                              j∈I
                where the equality sign holds for a = R i (v). Choosing a = R i in (6.6.4) gives


                                 (T v) i ≤ c i (R i ) +  p ij (R i )v j ,  i ∈ I.  (6.6.5)
                                                i∈I
                Define the lower bound

                                        m = min{(T v) i − v i }.
                                             i∈I
                Since m ≤ (T v) i − v i for all i, it follows from (6.6.5) that m + v i ≤ c i (R i ) +

                     p ij (R i )v j for all i ∈ I, and so
                  j∈I

                                c i (R i ) − m +  p ij (R i )v j ≥ v i ,  i ∈ I.
                                            j∈I
   262   263   264   265   266   267   268   269   270   271   272