Page 319 - Applied Probability

P. 319

14. Poisson Approximation
308
2
2
λ (m+1)
1
.
It follows that b 1
n
n
Evaluation of the constant b 2 is more diﬃcult. Consider β ∈ B α such
that |β ∩ α| = k for 0 <k <m + 1. Let U (1) and U (k) be the positions of
the ﬁrst and last of the k common points shared by β and α. If we condition
on the values U (1) = u (1) and U (k) = u (k) , then the indicator variables X α
and X β are independent and identically distributed. Hence,
Pr(X α =1,X β =1 | U (1) = u (1) ,U (k) = u (k) )
2
=Pr(X α =1 | U (1) = u (1) ,U (k) = u (k) ) .
In order that X α = 1, the m+1−k remaining points in α must be within a
distance s to the left of u (1) and a distance s to the right of u (k) . It follows
that Pr(X α =1 | U (1) = u (1) ,U (k) = u (k) ) ≤ (3s) m+1−k for u (k) − u (1) ≤ s.
This uniform bound yields a crude upper bound on p αβ if combined with
the probability that U (k) − U (1) ≤ s. This latter probability is
Pr(U (k) − U (1) ≤ s)= ks k−1 − (k − 1)s k
for exactly the same reasons that produced equality (14.5). Thus, we can
k
assert that p αβ ≤ q k =[ks k−1 −(k−1)s ](3s) 2[m+1−k] whenever |β∩α| = k.
m+1 n−m−1
Since there are such collections β for every α,
k m+1−k
m
n m +1 n − m − 1
b 2 ≤ q k . (14.6)
m +1 k m +1 − k
k=1
The upper bound (14.6) is hard to evaluate explicitly, but its dominant
contribution occurs when k = m. Indeed, because of the asymptotic rela-
tions n m+1 m 1 and q k s 2m−k+1 , the kth term of the sum satisﬁes
s
n m +1 n − m − 1 2m−k+2 2m−k+1

q k n s
m +1 k m +1 − k
(2m−k+1)(m+1)
n 2m−k+2 − m
n
k−m−1
= n m .
Thus, the dominant term of the sum is of order n −1/m , and for suﬃciently
large n and suﬃciently small s, the Poisson approximation Pr(S =0) ≈ e −λ
applies. The slow rate O(n −1/m ) of convergence of the total variation dis-
tance to 0 is rather disappointing in this example. The theoretical argu-
ments and numerical evidence presented by Glaz [9] and Roos [16] suggest
that a compound Poisson approximation performs better.
14.7 DNA Sequence Matching

A basic problem in DNA sequence analysis is to test whether two diﬀerent
sequences share signiﬁcant similarities. Strong similarity or homology often

314 315 316 317 318 319 320 321 322 323 324