Page 320 - Applied Probability

P. 320

309
14. Poisson Approximation
indicates a common evolutionary origin of the two sequences. In many cases
homology also indicates a common biochemical or structural function of the
genes encoded by the sequences.
To pose the problem of sequence comparison statistically, consider two
sequences with m and n bases, respectively. Now imagine sliding the ﬁrst
sequence along the second sequence. For some alignment the two sequences
share a region attaining the longest perfect match. (Here the match is local,
and no internal indels are permitted in either sequence.) Since ties can
occur, there may be several such regions. Let M mn be the random number
of base pairs involved in a longest perfect match. Under the null hypothesis
that the two sequences are unrelated, one can compute the approximate
distribution of M mn . If the observed value of M mn is inordinately large
according to this distribution, then signiﬁcant homology can be claimed.
Computing the distribution of M mn is subtle. Fortunately, the Chen-
Stein method is applicable. Assume ﬁrst that the bases appearing at the
various positions of either sequence are chosen independently from the set of
nucleotides {A,C,T,G} with probabilities q A , q C , q T , and q G , respectively.
The probability of a match between any two positions is
2
2
2
2
q = q + q + q + q .
A C T G
Deﬁne W ij to be the indicator random variable for the event of a match
between position i of the ﬁrst sequence and position j of the second se-
quence. The indicator random variable X ij of the event that a perfect
match of length t or longer begins at positions i and j of the two sequences
is given by
t−1

X ij =(1 − W i−1,j−1 ) W i+k,j+k .
k=0
This expression for X ij ignores end eﬀects. For m and n large compared
to t, end eﬀects will be trivial. Alternatively, imagine the two sequences
wrapped into circles, and interpret the subscript arithmetic involved in
deﬁning X ij as modulo m and n.
According to the Chen-Stein approximation, the distribution function of
the longest match M mn satisﬁes

Pr(M mn <t)=Pr( X ij =0)
(i,j)
≈ e −λ ,

where λ = Pr(X ij = 1). There are mn pairs (i, j) and each has
(i,j)
t
probability Pr(X ij =1)= (1 − q)q of initiating a perfect match. It follows
t
that λ = mn(1 − q)q . Evaluating the error bound (14.3) for the Poisson
approximation is possible, but too complicated to present here. See [20] for
the full treatment.

315 316 317 318 319 320 321 322 323 324 325