Page 293 - Applied Probability
P. 293

13
                              Sequence Analysis
                              13.1 Introduction

                              One of the first chores after a gene is mapped and cloned is to search
                              for similarities between the gene and previously cloned genes. These ho-
                              mologies can illuminate the evolutionary history of the new gene and the
                              structure and function of its derived protein. Indeed, it is fair to say that
                              sequence comparison is the single most useful application of the burgeoning
                              genetic databases. Cross species comparisons are being done on a massive
                              scale to identify gene families, regulatory motifs, and conserved re-
                              gions within genes. In this chapter, we explore some of the principles and
                              algorithms applied in recognizing DNA sequence patterns and producing
                              optimal alignments between two sequences. Our treatments of both prob-
                              lems are necessarily superficial.
                                In pattern recognition, we limit our comments to testing for the over-
                              abundance or underabundance of known patterns using simple Poisson
                              models. Pattern discovery is a more diffuse and difficult topic. In sequence
                              alignment we look only at dynamic programming algorithms. Com-
                              peting approaches tend to be either more heuristic or more convoluted
                              [2, 10, 11, 12]. Dynamic programming solves a problem by taking advantage
                              of already computed solutions of smaller examples of the same problem.
                              The general technique is well worth learning. Even the current specialized
                              algorithms for sequence alignment apply not only to molecular genetics but
                              also to other diverse problem areas such as text collation, error detection
                              and correction of coded messages, and matching of geological strata [5].


                              13.2 Pattern Matching


                              Occasionally, one is interested in counting the number of occurrences of
                              a particular pattern in a DNA sequence and in determining whether the
                              pattern is over, or under, represented. For instance, one might suspect
                              that the AluI restriction site AGCT occurs too seldomly in a given DNA
                              sequence x =(x 1 ,...,x n ). For the sake of simplicity, assume that each
                              x i of x is drawn independently from the set of possible bases A, C, G,
                              and T with probabilities p A , p C , p G, and p T , respectively. For any i be-
                              tween 4 and n, the probability that (x i−3 ,x i−2 ,x i−1 ,x i ) = (A,G,C,T) is
                              p A p Gp C p T . Furthermore, the number of bases between neighboring oc-
                              currences of this pattern is nearly geometrically distributed with success
   288   289   290   291   292   293   294   295   296   297   298