Page 308 - Applied Probability
P. 308
13. Sequence Analysis
296
length of the longest run of A’s in the word. Show that the distribution
A
of R satisfies the recurrence relation
n
m+1
A
Pr(R ≤ m)=
A
n−i
n
i=1 p i−1 (1 − p A )Pr(R A ≤ m)
A
for n> m and the initial condition Pr(R ≤ m) = 1 for n ≤ m [9].
n
Similar results obtain for runs of any other letter.
A
10. In the context of the last problem, let S be the length of the longest
n
C
T
run of any letter when the word starts with an A. Define S , S ,
n
n
and S G similarly. Argue that
n
m
i−1 T
A
Pr(S ≤ m)= p p T Pr(S ≤ m) (13.13)
n A n−i
i=1
C
G
+ p C Pr(S n−i ≤ m)+ p G Pr(S n−i ≤ m)
A
for n> m and that Pr(S ≤ m)=1 for n ≤ m. Show how recurrence
n
T
C
(13.13) and similar recurrences for the distributions of S , S , and
n
n
S n G permit exact calculation of the distribution of the length of the
maximal run R n of any letter type.
13.9 References
[1] Aldous D (1989) Probability Approximations via the Poisson Clumping
Heuristic. Springer-Verlag, New York
[2] Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological Sequence
Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-
bridge University Press, Cambridge
[3] Feller W (1969) An Introduction to Probability Theory and its Appli-
cations, Vol 1, 2nd ed. Wiley, New York
[4] Grimmett GR, Stirzaker DR (1992) Probability and Random Processes,
2nd ed. Oxford University Press, Oxford
[5] Kruskal JB (1983) An overview of sequence comparison: Time warps,
string edits, and macromolecules. SIAM Review 25:201–237
[6] Needleman SB, Wunsch CD (1970) A general method applicable to
the search for similarities in the amino acid sequences of two proteins.
J Mol Biol 48:443–453
[7] Smith TF, Waterman MS (1981) The identification of common mole-
cular subsequences. J Mol Biol 147:195–197