Page 261 - A First Course In Stochastic Models

P. 261

254 DISCRETE-TIME MARKOV DECISION PROCESSES

(6.5.3). By the dual theorem of linear programming, the primal and dual lin-
ear programs have the same optimal objective-function value. Hence the minimal
objective-function value of the linear program (6.5.1) equals the minimal average
cost g . Next we show that an optimal basic solution (x ) to the linear program
∗
∗
ia
(6.5.1) induces an average cost optimal policy. To do so, deﬁne the set
 
 

∗
S 0 = x > 0 .
ia
i
 
a∈A(i)
Then the set S 0 is closed under any policy R having the property that action
a = R i satisﬁes x ∗ > 0 for all i ∈ S 0 . To see this, suppose that p ij (R i ) > 0 for
ia
some i ∈ S 0 and j /∈ S 0 . Then the ﬁrst set of constraints of the linear program
(6.5.1) implies that x ∗ > 0, contradicting j /∈ S 0 . Next consider the set I 0 as
a ja
constructed in the linear programming algorithm. Let R be a policy such that the
∗
actions R for i ∈ I 0 are chosen according to the algorithm. It remains to verify
∗
i
that I 0 = I and that policy R is average cost optimal. To do so, let {g , v } be
∗
∗
∗
i
the particular optimal basic solution to the primal linear program (6.5.3) such that
this basic solution is complementary to the optimal basic solution (x ) of the dual
∗
ia
linear program (6.5.1). Then, by the complementary slackness property of linear
programming,

∗ ∗ ∗ ∗ ∗
g + v − p ij (R )v = c i (R ), i ∈ S 0 .
i i j i
j∈I

∗
The term p ij (R )v can be replaced by p ij (R )v for i ∈ S 0 , since
∗
∗
∗
j∈I i j j∈S 0 i j
the set S 0 is closed under policy R . Thus, by Theorem 6.2.1, we can conclude
∗
that g i (R ) = g for all i ∈ S 0 . The states in I 0 \S 0 are transient under policy R ∗
∗
∗
∗
∗
and are ultimately leading to a state in S 0 . Hence g i (R ) = g for all i ∈ I 0 . To
prove that I 0 = I, assume to the contrary that I 0 = I. By the construction of I 0 ,
the set I\I 0 is closed under any policy. Let R 0 be any average cost optimal policy.
Deﬁne the policy R 1 by

R (i), i ∈ I 0 ,
∗
R 1 (i) =
R 0 (i), i ∈ I\I 0 .
Since I\I 0 and I 0 are both closed sets under policy R 1 , we have constructed an
average cost optimal policy with two disjoint closed sets. This contradicts the weak
unichain assumption. Hence I 0 = I. This completes the proof.
We illustrate the linear programming formulation of the Markov decision problem
from Example 6.1.1. The speciﬁcation of the basic elements of the Markov decision
model for this problem is given in Section 6.1.

256 257 258 259 260 261 262 263 264 265 266