Page 308 - Applied Probability
P. 308

13. Sequence Analysis
                              296
                                   length of the longest run of A’s in the word. Show that the distribution
                                       A
                                   of R satisfies the recurrence relation
                                       n
                                                            m+1

                                                A
                                           Pr(R ≤ m)=
                                                                A
                                                                               n−i
                                                n
                                                            i=1  p i−1 (1 − p A )Pr(R A  ≤ m)
                                                                        A
                                   for n> m and the initial condition Pr(R ≤ m) = 1 for n ≤ m [9].
                                                                        n
                                   Similar results obtain for runs of any other letter.
                                                                      A
                                10. In the context of the last problem, let S be the length of the longest
                                                                      n
                                                                                             C
                                                                                         T
                                   run of any letter when the word starts with an A. Define S , S ,
                                                                                         n
                                                                                             n
                                   and S G  similarly. Argue that
                                         n
                                                         m
                                                        	   i−1       T
                                            A
                                        Pr(S ≤ m)=         p    p T Pr(S  ≤ m)           (13.13)
                                            n               A         n−i
                                                        i=1
                                                                 C
                                                                                  G
                                                        + p C Pr(S n−i  ≤ m)+ p G Pr(S n−i  ≤ m)
                                                         A
                                   for n> m and that Pr(S ≤ m)=1 for n ≤ m. Show how recurrence
                                                         n
                                                                                     T
                                                                                         C
                                   (13.13) and similar recurrences for the distributions of S , S , and
                                                                                     n
                                                                                         n
                                   S n G  permit exact calculation of the distribution of the length of the
                                   maximal run R n of any letter type.
                              13.9    References
                               [1] Aldous D (1989) Probability Approximations via the Poisson Clumping
                                   Heuristic. Springer-Verlag, New York
                               [2] Durbin R, Eddy S, Krogh A, Mitchison G (1998) Biological Sequence
                                   Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cam-
                                   bridge University Press, Cambridge
                               [3] Feller W (1969) An Introduction to Probability Theory and its Appli-
                                   cations, Vol 1, 2nd ed. Wiley, New York
                               [4] Grimmett GR, Stirzaker DR (1992) Probability and Random Processes,
                                   2nd ed. Oxford University Press, Oxford
                               [5] Kruskal JB (1983) An overview of sequence comparison: Time warps,
                                   string edits, and macromolecules. SIAM Review 25:201–237
                               [6] Needleman SB, Wunsch CD (1970) A general method applicable to
                                   the search for similarities in the amino acid sequences of two proteins.
                                   J Mol Biol 48:443–453
                               [7] Smith TF, Waterman MS (1981) The identification of common mole-
                                   cular subsequences. J Mol Biol 147:195–197
   303   304   305   306   307   308   309   310   311   312   313