Page 294 - Applied Probability
P. 294
13. Sequence Analysis
282
probability p Ap G p C p T . If we go over to a continuous approximation, we can
envision AluI sites scattered randomly according to a Poisson process on
the interval [0,n] with intensity p A p G p C p T . This model is motivated by the
observations that interarrival times in a Poisson process are exponentially
distributed and that the exponential distribution is the continuous analog
of the geometric distribution.
For other restriction enzymes, the situation is more complicated. For in-
stance, consider the restriction enzyme HhaI with recognition site GCGC.
In this case, recognition sites tend to occur in clumps. Thus, if GCGC oc-
curs, it is easy to achieve a second recognition site by extending GCGC
to GCGCGC. In treating this and more general patterns, we had better
be specific in defining a clump. The most workable definition of a clump
involves renewal theory and departs slightly from standard English usage.
In a finite DNA sequence, if the first occurrence of the pattern ends at posi-
tion n, then we have the first renewal at position n. Subsequent renewals of
the pattern occur at subsequent nonoverlapping occurrences of the pattern.
For example, in the sequence TGCGCAGCGCGCGCGCA, renewals occur
at positions 5, 10, and 14. A clump is formed by a renewal of the pattern
and any overlapping realizations of the pattern to the right of the renewal.
Thus, the clump sizes for the three renewals just noted are 1, 2, and 2.
Rather than treat this specific case further, let us consider a general
pattern R =(r 1 ,...,r m ) and investigate its expected clump size c.To
determine c, we set R (i) =(r 1 ,...,r i ) and R (i) =(r m−i+1 ,...,r m ). The
equation we are looking for is
m−1
c =1 + p r i+1 ··· p r m {R (i) =R (i) } . (13.1)
1
i=1
The constant 1 on the right of this equation simply counts a renewal of
the pattern. The ith term in the sum involves the overlap at i sites of
the renewal with a second realization of the pattern to the right of the
renewal. To attain this second realization, the condition R (i) = R (i) must
hold. The remaining m−i bases of the second realization must also fill out
.
the pattern. This further condition holds with probability p r i+1 ··· p r m
Once again we suppose that clumps occur according to a Poisson process
with intensity λ [1]. Naturally, this assumption improves for restriction
enzymes that cut less frequently. Ignoring the fact that R cannot start at
any of the last m − 1 sites, the expected number of occurrences of the
pattern R is the product
.
ncλ = np r 1 ··· p r m
Solving for the mean distance λ −1 between renewals yields
m
c 1
−1
λ = = 1 {R (i) =R (i) } ,
p r 1 ··· p r m p r 1 ··· p r i
i=1