Page 301 - Applied Probability
P. 301

289
                                                                      13. Sequence Analysis
                              13.5 Parallel Processing and Memory Reduction
                              Computational complexity is not the only criterion for evaluating an algo-
                              rithm. Demands on memory and the potential for parallel processing are
                              also important. Proper implementation of the Needleman-Wunsch algo-
                              rithm leads to substantial improvements in both regards. Parallel process-
                              ing cannot be achieved by proceeding row by row or column by column in
                              the dynamic programming matrix, but it can be achieved by proceeding
                              along parallel diagonal lines. With a proportional gap penalty g(k)= kδ,
                              suppose we know the entries D i,j for all combinations (i, j)with i+j ≤ m.
                              Then each entry D i,j along the diagonal i + j = m + 1 is determined
                              independently via equation (13.5) from three entries in the two previous
                              diagonal lines i + j = m and i + j = m − 1. A multiprocessor machine with
                              shared memory could easily exploit this fact.
                                Memory reduction is more subtle. As presented, the Needleman-Wunsch
                              algorithm requires storing a pointer at each of mn cells. If it were not for
                              this storage requirement, then we could proceed row by row (or column by
                              column), discarding rows (or columns) as we go. Discarding already visited
                              rows or columns reduces the memory requirement to min{m, n}. It is possi-
                              ble to tweak the Needleman-Wunsch algorithm in a manner that retains this
                              storage advantage, preserves the traceback property, and merely doubles
                              the overall computational complexity. For the sake of convenience, assume
                                                                     m
                              a proportional gap penalty and choose i =    . Any optimal alignment of
                                                                     2
                              x against y matches the letter x i with either a letter y j or a −. In the former
                              case, we perform the Needleman-Wunsch algorithm without traceback sep-
                              arately on the left substring (x 1 ,...,x i−1 ) and the full string y, and on the
                              right substring (x i+1 ,...,x m ) and the full string y. For the left substring of
                              (x 1 ,... ,x i−1 ), let the optimal alignment ending with the pair (i − 1,j − 1)
                              have distance L i−1,j−1 , and for the right substring (x i+1 ,...,x m ), let the
                              optimal alignment beginning with the pair (i +1,j + 1) have distance
                              R(i +1,j + 1). (It is trivial to modify the Needleman-Wunsch algorithm to
                              give optimal subalignments beginning with index pairs rather than ending
                              with index pairs.) If we match x i against −, then the pertinent left distance
                              for each j is L i−1,j rather than L i−1,j−1 . The best index j to match with
                              i solves the problem

                                min{L i−1,j−1 + d(x i ,y j )+ R i+1,j+1 ,L i−1,j + d(x i , −)+ R i+1,j+1 }.(13.8)
                                 j
                              Once we identify j, we can fill out the rest of the optimal match by recur-
                              sively applying our modified algorithm to the pair of left x and y substrings
                              and to the pair of right x and y substrings.
                                Although it is clear that the maximum storage required by this scheme is
                              m+n, we must make certain the computational complexity does not balloon
                              out of control. One thing we can dismiss right off is the depicted additions
                              and minima involved in finding the solution to (13.8). The amount of work
   296   297   298   299   300   301   302   303   304   305   306