Page 274 - A First Course In Stochastic Models

P. 274

CONVERGENCE PROOFS 267

ℓ
Denote by a ℓ = e −µD (µD) /ℓ! the probability of the completion of ℓ service
phases during an interarrival time D when the server is continuously busy. Then
the recursive value-iteration equation (6.6.1) becomes

i+r−1 i+r−1

V n (i) = a ℓ V n−1 (i + r − ℓ) + 1 − a ℓ V n−1 (0), 0 ≤ i ≤ Kr − r
ℓ=0 ℓ=0
i−1 i−1

V n (i) = 1 + a ℓ V n−1 (i − ℓ) + 1 − a ℓ V n−1 (0), Kr − r < i ≤ Kr.
ℓ=0 ℓ=0
The discrete-time Markov chain describing the number of service phases present at
the arrival epochs is aperiodic. Hence the lower and upper bounds m n and M n from
the value-iteration algorithm both converge to the long-run fraction of customers
who are lost.

6.7 CONVERGENCE PROOFS

In this section we give convergence proofs for the policy-iteration algorithm and
the value-iteration algorithm. The ﬁnite convergence of the policy-iteration algo-
rithm is proved for the unichain case. For the standard value-iteration algorithm
the convergence of the bounds m n and M n to the same limit is proved under the
unichain assumption together with the assumption that the one-step transition prob-
ability p ii (a) > 0 for all i ∈ I and a ∈ A(i). The latter aperiodicity assumption
is automatically satisﬁed when the data transformation discussed in Section 6.6 is
applied.

Convergence proof for policy iteration
We ﬁrst establish a lexicographical ordering for the average cost and the relative
values associated with the policies that are generated by the algorithm. For that
purpose we need to standardize the relative value functions since a relative value
function is not uniquely determined. Let us number or renumber the possible states
as i = 1, . . . , N. In view of the fact that the relative values of a given policy
are unique up to an additive constant, the sequence of policies generated by the
algorithm does not depend on the particular choice of the relative value function
for a given policy. For each stationary policy Q, we now consider the particular
relative value function w i (Q) deﬁned by (6.3.1), where the regeneration state r is
chosen as the largest state in I (Q). The set I (Q) is deﬁned by

I (Q) = the set of states that are recurrent under policy Q.

Let R and R be immediate successors in the sequence of policies generated by the
algorithm. Suppose that R = R. We assert that either

269 270 271 272 273 274 275 276 277 278 279