Page 299 - Applied Probability
P. 299
287
13. Sequence Analysis
run entirely within the rectangle {(i, j) ∈ [0,m] × [0,n]}. To construct an
optimal alignment, all we have to do is record at each application of the
recurrence (13.5) which of the three terms gives the minimum. In case of a
tie, we arbitrarily choose a best term. If we keep track of this information
in a system of pointers, then when we reach entry (m, n) in the matrix of
distances, we can follow the pointers back to (0, 0). At each step we recover
one more pair of an optimal alignment, working from its right end to its
left end.
If more than one best alignment exists, and we want to identify all of
them, then we need to keep track of a stack of equivalent pointers at each
entry of the D matrix. Only in the case of a tie will a stack contain more
than a single pointer. In finding the first optimal alignment, we start at
(m, n) and follow the path defined by the top pointer of each successively
encountered stack. Once we reach (0, 0) or a boundary and identify this
alignment, we reverse direction along the pointers previously traversed.
Each time we backup and hit a stack, we move down one pointer in the
stack, and start a new forward path commencing with that pointer. When
we reach (0, 0) or a boundary, we identify a new optimal alignment. If at
any stage in a backtrack we hit the bottom of a stack and cannot descend
farther, then we backtrack along the bottom pointer to the next entry of the
D matrix. The backtracking process ends when the final backtrack hits the
bottom of the stack at (m, n). Because it produces successive alignments
that share as much of their right ends as possible, backtracking is very
efficient.
An interesting variant of the dynamic programming algorithm is specifi-
cally designed to handle the simultaneous insertion or deletion of multiple
bases. This occurs often enough in the evolution of DNA sequences to war-
rant comment. Suppose that we penalize an indel of length k by g(k). For
example, we could take g(k)= α +(k − 1)β; the case α = β = δ obvi-
ously coincides with our previous penalty on single indels. In general, it
is desirable to impose the subadditivity condition g(k) ≤ kg(1) on g(k)
because the insertion or deletion of a block of bases is more likely than the
sequential insertion or deletion of each base of the block.
To compute the distance D(x, y) between two strings x =(x 1 ,...,x m )
and y =(y 1 ,... ,y n ), it is necessary to define three quantities E ij , F ij ,
and G ij analogous to D ij . These are the minimum distances between two
alignments of the partial strings (x 1 ,... ,x i ) and (y 1 ,...,y j ) ending in x i ,
−
− x i
, and , respectively. Clearly, we have D ij =min{E ij ,F ij ,G ij }. The
y j y j
boundary conditions for these matrices are
E 00 =0, E i0 = g(i), E 0j = ∞,
F 00 =0, F i0 = ∞, F 0j = g(j)
G 00 =0, G i0 = ∞, G 0j = ∞
for i and j positive. The infinities appearing among these boundary condi-