Page 304 - Applied Probability
P. 304
13. Sequence Analysis
292
We first treat the case of a proportional gap score h(k)= −kδ. In de-
termining the best local alignment of two strings x =(x 1 ,...,x m ) and
y =(y 1 ,...,y n ), we define the matrix entry
= max{0,S[(x i ,... ,x k ), (y j ,... ,y l )] : 1 ≤ i ≤ k, 1 ≤ j ≤ l}.
H kl
In other words, H kl is the best similarity for an x substring ending with
x k against a y substring ending with y l . The presence of 0 in the defini-
tion of H kl indicates that a null string is a candidate substring in each case.
The Smith-Waterman algorithm computes H(x, y) = max kl H kl and a best
local alignment in the process. In computing the H kl by dynamic program-
ming, we have the boundary conditions H k0 = H 0l = 0 because aligning an
x (or y) substring against a null string gives a negative similarity score un-
less the x (or y) substring itself is null. Finally, the dynamic programming
algorithm feeds these boundary conditions into the recurrence
H kl = max{0,H k−1,l − δ, H k,l−1 − δ, H k−1,l−1 + s(x k ,y l )}. (13.10)
To find a best local alignment, we establish a system of pointers. Tracing
these pointers back from the best pair (k, l), we either reach a boundary or
a pointer points to the 0 term in (13.10). In either case, we terminate the
optimal substrings on the left.
We can extend the Smith-Waterman algorithm to an affine gap score
h(k)= −α − (k − 1)β by introducing the maximum substring similarities
E kl , F kl , and G kl for substring alignments ending in x k , − , and x k ,
− y l y l
respectively. In this notation H kl = max{E kl ,F kl ,G kl }, and we have the
boundary conditions E kl = F kl = G kl = 0 if either k =0 or l = 0. The
corresponding recurrences are
E kl = max{0,E k−1,l − β, F k−1,l − α, G k−1,l − α}
F kl = max{0,E k,l−1 − α, F k,l−1 − β, G k,l−1 − α}
G kl = max{0,E k−1,l−1 + s, F k−1,l−1 + s, G k−1,l−1 + s},
where in the last equation we have abbreviated s = s(x k ,y l ).
Finally, to identify repeat units within a DNA sequence x, it is helpful to
compare x to itself. Since x matches itself perfectly, the best local alignment
is the trivial global alignment. To avoid drawing this conclusion in the
Smith-Waterman algorithm, we define the new boundary condition H ii =0
for all i and invoke the recurrence (13.10) only on the upper triangle of (H ij )
where i< j.
13.8 Multiple Sequence Comparisons
Alignment of three or more sequences is considerably harder than alignment
of two sequences [8, 12]. To see how the computational complexity of global