Page 320 - Applied Probability
P. 320
309
14. Poisson Approximation
indicates a common evolutionary origin of the two sequences. In many cases
homology also indicates a common biochemical or structural function of the
genes encoded by the sequences.
To pose the problem of sequence comparison statistically, consider two
sequences with m and n bases, respectively. Now imagine sliding the first
sequence along the second sequence. For some alignment the two sequences
share a region attaining the longest perfect match. (Here the match is local,
and no internal indels are permitted in either sequence.) Since ties can
occur, there may be several such regions. Let M mn be the random number
of base pairs involved in a longest perfect match. Under the null hypothesis
that the two sequences are unrelated, one can compute the approximate
distribution of M mn . If the observed value of M mn is inordinately large
according to this distribution, then significant homology can be claimed.
Computing the distribution of M mn is subtle. Fortunately, the Chen-
Stein method is applicable. Assume first that the bases appearing at the
various positions of either sequence are chosen independently from the set of
nucleotides {A,C,T,G} with probabilities q A , q C , q T , and q G , respectively.
The probability of a match between any two positions is
2
2
2
2
q = q + q + q + q .
A C T G
Define W ij to be the indicator random variable for the event of a match
between position i of the first sequence and position j of the second se-
quence. The indicator random variable X ij of the event that a perfect
match of length t or longer begins at positions i and j of the two sequences
is given by
t−1
X ij =(1 − W i−1,j−1 ) W i+k,j+k .
k=0
This expression for X ij ignores end effects. For m and n large compared
to t, end effects will be trivial. Alternatively, imagine the two sequences
wrapped into circles, and interpret the subscript arithmetic involved in
defining X ij as modulo m and n.
According to the Chen-Stein approximation, the distribution function of
the longest match M mn satisfies
Pr(M mn <t)=Pr( X ij =0)
(i,j)
≈ e −λ ,
where λ = Pr(X ij = 1). There are mn pairs (i, j) and each has
(i,j)
t
probability Pr(X ij =1)= (1 − q)q of initiating a perfect match. It follows
t
that λ = mn(1 − q)q . Evaluating the error bound (14.3) for the Poisson
approximation is possible, but too complicated to present here. See [20] for
the full treatment.