Page 319 - Applied Probability
P. 319
14. Poisson Approximation
308
2
2
λ (m+1)
1
.
It follows that b 1
n
n
Evaluation of the constant b 2 is more difficult. Consider β ∈ B α such
that |β ∩ α| = k for 0 <k <m + 1. Let U (1) and U (k) be the positions of
the first and last of the k common points shared by β and α. If we condition
on the values U (1) = u (1) and U (k) = u (k) , then the indicator variables X α
and X β are independent and identically distributed. Hence,
Pr(X α =1,X β =1 | U (1) = u (1) ,U (k) = u (k) )
2
=Pr(X α =1 | U (1) = u (1) ,U (k) = u (k) ) .
In order that X α = 1, the m+1−k remaining points in α must be within a
distance s to the left of u (1) and a distance s to the right of u (k) . It follows
that Pr(X α =1 | U (1) = u (1) ,U (k) = u (k) ) ≤ (3s) m+1−k for u (k) − u (1) ≤ s.
This uniform bound yields a crude upper bound on p αβ if combined with
the probability that U (k) − U (1) ≤ s. This latter probability is
Pr(U (k) − U (1) ≤ s)= ks k−1 − (k − 1)s k
for exactly the same reasons that produced equality (14.5). Thus, we can
k
assert that p αβ ≤ q k =[ks k−1 −(k−1)s ](3s) 2[m+1−k] whenever |β∩α| = k.
m+1 n−m−1
Since there are such collections β for every α,
k m+1−k
m
n m +1 n − m − 1
b 2 ≤ q k . (14.6)
m +1 k m +1 − k
k=1
The upper bound (14.6) is hard to evaluate explicitly, but its dominant
contribution occurs when k = m. Indeed, because of the asymptotic rela-
tions n m+1 m 1 and q k s 2m−k+1 , the kth term of the sum satisfies
s
n m +1 n − m − 1 2m−k+2 2m−k+1
q k n s
m +1 k m +1 − k
(2m−k+1)(m+1)
n 2m−k+2 − m
n
k−m−1
= n m .
Thus, the dominant term of the sum is of order n −1/m , and for sufficiently
large n and sufficiently small s, the Poisson approximation Pr(S =0) ≈ e −λ
applies. The slow rate O(n −1/m ) of convergence of the total variation dis-
tance to 0 is rather disappointing in this example. The theoretical argu-
ments and numerical evidence presented by Glaz [9] and Roos [16] suggest
that a compound Poisson approximation performs better.
14.7 DNA Sequence Matching
A basic problem in DNA sequence analysis is to test whether two different
sequences share significant similarities. Strong similarity or homology often