Page 320 - Applied Probability
P. 320

309
                                                                  14. Poisson Approximation
                              indicates a common evolutionary origin of the two sequences. In many cases
                              homology also indicates a common biochemical or structural function of the
                              genes encoded by the sequences.
                                To pose the problem of sequence comparison statistically, consider two
                              sequences with m and n bases, respectively. Now imagine sliding the first
                              sequence along the second sequence. For some alignment the two sequences
                              share a region attaining the longest perfect match. (Here the match is local,
                              and no internal indels are permitted in either sequence.) Since ties can
                              occur, there may be several such regions. Let M mn be the random number
                              of base pairs involved in a longest perfect match. Under the null hypothesis
                              that the two sequences are unrelated, one can compute the approximate
                              distribution of M mn . If the observed value of M mn is inordinately large
                              according to this distribution, then significant homology can be claimed.
                                Computing the distribution of M mn is subtle. Fortunately, the Chen-
                              Stein method is applicable. Assume first that the bases appearing at the
                              various positions of either sequence are chosen independently from the set of
                              nucleotides {A,C,T,G} with probabilities q A , q C , q T , and q G , respectively.
                              The probability of a match between any two positions is
                                                           2
                                                               2
                                                                    2
                                                                         2
                                                   q  = q + q + q + q .
                                                           A   C    T   G
                              Define W ij to be the indicator random variable for the event of a match
                              between position i of the first sequence and position j of the second se-
                              quence. The indicator random variable X ij of the event that a perfect
                              match of length t or longer begins at positions i and j of the two sequences
                              is given by
                                                                    t−1

                                              X ij  =(1 − W i−1,j−1 )  W i+k,j+k .
                                                                    k=0
                              This expression for X ij ignores end effects. For m and n large compared
                              to t, end effects will be trivial. Alternatively, imagine the two sequences
                              wrapped into circles, and interpret the subscript arithmetic involved in
                              defining X ij as modulo m and n.
                                According to the Chen-Stein approximation, the distribution function of
                              the longest match M mn satisfies

                                               Pr(M mn <t)=Pr(         X ij =0)
                                                                   (i,j)
                                                             ≈ e −λ ,


                              where λ =        Pr(X ij = 1). There are mn pairs (i, j) and each has
                                           (i,j)
                                                            t
                              probability Pr(X ij =1)= (1 − q)q of initiating a perfect match. It follows
                                                 t
                              that λ = mn(1 − q)q . Evaluating the error bound (14.3) for the Poisson
                              approximation is possible, but too complicated to present here. See [20] for
                              the full treatment.
   315   316   317   318   319   320   321   322   323   324   325