Page 293 - Applied Probability
P. 293
13
Sequence Analysis
13.1 Introduction
One of the first chores after a gene is mapped and cloned is to search
for similarities between the gene and previously cloned genes. These ho-
mologies can illuminate the evolutionary history of the new gene and the
structure and function of its derived protein. Indeed, it is fair to say that
sequence comparison is the single most useful application of the burgeoning
genetic databases. Cross species comparisons are being done on a massive
scale to identify gene families, regulatory motifs, and conserved re-
gions within genes. In this chapter, we explore some of the principles and
algorithms applied in recognizing DNA sequence patterns and producing
optimal alignments between two sequences. Our treatments of both prob-
lems are necessarily superficial.
In pattern recognition, we limit our comments to testing for the over-
abundance or underabundance of known patterns using simple Poisson
models. Pattern discovery is a more diffuse and difficult topic. In sequence
alignment we look only at dynamic programming algorithms. Com-
peting approaches tend to be either more heuristic or more convoluted
[2, 10, 11, 12]. Dynamic programming solves a problem by taking advantage
of already computed solutions of smaller examples of the same problem.
The general technique is well worth learning. Even the current specialized
algorithms for sequence alignment apply not only to molecular genetics but
also to other diverse problem areas such as text collation, error detection
and correction of coded messages, and matching of geological strata [5].
13.2 Pattern Matching
Occasionally, one is interested in counting the number of occurrences of
a particular pattern in a DNA sequence and in determining whether the
pattern is over, or under, represented. For instance, one might suspect
that the AluI restriction site AGCT occurs too seldomly in a given DNA
sequence x =(x 1 ,...,x n ). For the sake of simplicity, assume that each
x i of x is drawn independently from the set of possible bases A, C, G,
and T with probabilities p A , p C , p G, and p T , respectively. For any i be-
tween 4 and n, the probability that (x i−3 ,x i−2 ,x i−1 ,x i ) = (A,G,C,T) is
p A p Gp C p T . Furthermore, the number of bases between neighboring oc-
currences of this pattern is nearly geometrically distributed with success