Page 299 - Applied Probability
P. 299

287
                                                                      13. Sequence Analysis
                              run entirely within the rectangle {(i, j) ∈ [0,m] × [0,n]}. To construct an
                              optimal alignment, all we have to do is record at each application of the
                              recurrence (13.5) which of the three terms gives the minimum. In case of a
                              tie, we arbitrarily choose a best term. If we keep track of this information
                              in a system of pointers, then when we reach entry (m, n) in the matrix of
                              distances, we can follow the pointers back to (0, 0). At each step we recover
                              one more pair of an optimal alignment, working from its right end to its
                              left end.
                                If more than one best alignment exists, and we want to identify all of
                              them, then we need to keep track of a stack of equivalent pointers at each
                              entry of the D matrix. Only in the case of a tie will a stack contain more
                              than a single pointer. In finding the first optimal alignment, we start at
                              (m, n) and follow the path defined by the top pointer of each successively
                              encountered stack. Once we reach (0, 0) or a boundary and identify this
                              alignment, we reverse direction along the pointers previously traversed.
                              Each time we backup and hit a stack, we move down one pointer in the
                              stack, and start a new forward path commencing with that pointer. When
                              we reach (0, 0) or a boundary, we identify a new optimal alignment. If at
                              any stage in a backtrack we hit the bottom of a stack and cannot descend
                              farther, then we backtrack along the bottom pointer to the next entry of the
                              D matrix. The backtracking process ends when the final backtrack hits the
                              bottom of the stack at (m, n). Because it produces successive alignments
                              that share as much of their right ends as possible, backtracking is very
                              efficient.
                                An interesting variant of the dynamic programming algorithm is specifi-
                              cally designed to handle the simultaneous insertion or deletion of multiple
                              bases. This occurs often enough in the evolution of DNA sequences to war-
                              rant comment. Suppose that we penalize an indel of length k by g(k). For
                              example, we could take g(k)= α +(k − 1)β; the case α = β = δ obvi-
                              ously coincides with our previous penalty on single indels. In general, it
                              is desirable to impose the subadditivity condition g(k) ≤ kg(1) on g(k)
                              because the insertion or deletion of a block of bases is more likely than the
                              sequential insertion or deletion of each base of the block.
                                To compute the distance D(x, y) between two strings x =(x 1 ,...,x m )
                              and y =(y 1 ,... ,y n ), it is necessary to define three quantities E ij , F ij ,
                              and G ij analogous to D ij . These are the minimum distances between two

                              alignments of the partial strings (x 1 ,... ,x i ) and (y 1 ,...,y j ) ending in  x i  ,
                                                                                            −

                                −       x i
                                  , and   , respectively. Clearly, we have D ij =min{E ij ,F ij ,G ij }. The
                               y j      y j
                              boundary conditions for these matrices are
                                           E 00 =0,    E i0  = g(i),   E 0j = ∞,
                                           F 00 =0,     F i0  = ∞,     F 0j = g(j)
                                           G 00 =0,    G i0  = ∞,      G 0j = ∞
                              for i and j positive. The infinities appearing among these boundary condi-
   294   295   296   297   298   299   300   301   302   303   304