Page 294 - Applied Probability
P. 294

13. Sequence Analysis
                              282
                              probability p Ap G p C p T . If we go over to a continuous approximation, we can
                              envision AluI sites scattered randomly according to a Poisson process on
                              the interval [0,n] with intensity p A p G p C p T . This model is motivated by the
                              observations that interarrival times in a Poisson process are exponentially
                              distributed and that the exponential distribution is the continuous analog
                              of the geometric distribution.
                                For other restriction enzymes, the situation is more complicated. For in-
                              stance, consider the restriction enzyme HhaI with recognition site GCGC.
                              In this case, recognition sites tend to occur in clumps. Thus, if GCGC oc-
                              curs, it is easy to achieve a second recognition site by extending GCGC
                              to GCGCGC. In treating this and more general patterns, we had better
                              be specific in defining a clump. The most workable definition of a clump
                              involves renewal theory and departs slightly from standard English usage.
                              In a finite DNA sequence, if the first occurrence of the pattern ends at posi-
                              tion n, then we have the first renewal at position n. Subsequent renewals of
                              the pattern occur at subsequent nonoverlapping occurrences of the pattern.
                              For example, in the sequence TGCGCAGCGCGCGCGCA, renewals occur
                              at positions 5, 10, and 14. A clump is formed by a renewal of the pattern
                              and any overlapping realizations of the pattern to the right of the renewal.
                              Thus, the clump sizes for the three renewals just noted are 1, 2, and 2.
                                Rather than treat this specific case further, let us consider a general
                              pattern R =(r 1 ,...,r m ) and investigate its expected clump size c.To
                              determine c, we set R (i)  =(r 1 ,...,r i ) and R (i) =(r m−i+1 ,...,r m ). The
                              equation we are looking for is
                                                       m−1

                                              c =1 +       p r i+1  ··· p r m {R (i) =R (i) } .  (13.1)
                                                                      1
                                                        i=1
                              The constant 1 on the right of this equation simply counts a renewal of
                              the pattern. The ith term in the sum involves the overlap at i sites of
                              the renewal with a second realization of the pattern to the right of the
                              renewal. To attain this second realization, the condition R (i)  = R (i) must
                              hold. The remaining m−i bases of the second realization must also fill out
                                                                                            .
                              the pattern. This further condition holds with probability p r i+1  ··· p r m
                                Once again we suppose that clumps occur according to a Poisson process
                              with intensity λ [1]. Naturally, this assumption improves for restriction
                              enzymes that cut less frequently. Ignoring the fact that R cannot start at
                              any of the last m − 1 sites, the expected number of occurrences of the
                              pattern R is the product
                                                                        .
                                                     ncλ = np r 1  ··· p r m
                              Solving for the mean distance λ −1  between renewals yields
                                                              m
                                                   c         	      1
                                        −1
                                       λ    =             =              1 {R (i)  =R (i) } ,
                                               p r 1  ··· p r m  p r 1  ··· p r i
                                                             i=1
   289   290   291   292   293   294   295   296   297   298   299