Page 298 - Applied Probability
P. 298

13. Sequence Analysis
                              286
                              The first two deletions exact penalty δ each, the next two substitutions ex-
                              act penalties d(D, R) and d(U, E), and the last two insertions exact penalty
                              δ each. This tallies with the distance 4δ + d(D, R)+ d(U, E) attributed to
                              the alignment
                                             IN     -  -  DU     STRY
                                             INTE         R   E  ST     -  -
                                In general, each listing provides a path from x to y corresponding to a
                              unique alignment. The length of the path is the sum the penalties exacted
                              along the way; this length perfectly matches the distance of the correspond-
                              ing alignment if no repetitious operations such as the insertion and deletion
                              of the same blank are carried out. The distance D(x, y) is the length of
                              the minimal path from x to y. The relevance of these observations to prov-
                              ing the triangle inequality lies in appending a minimal path from y to z
                              to a minimal path from x to y. The length of the concatenated path is
                              D(x, y)+ D(y, z). Being one of many paths connecting x and z, the con-
                              catenated path has length bounded below by D(x, z). Thus, the triangle
                              property is true.
                                One of the difficulties with the Levenshtein metric is that it is implic-
                              itly given as the solution to a minimization problem over the large space
                              of alignments. Even though we cannot give an explicit formula for the
                              metric, we can calculate its value with reasonable efficiency via a dy-
                              namic programming algorithm introduced by Needleman and Wunsch [6].
                              If x =(x 1 ,...,x m ) and y =(y 1 ,...,y n ) are two strings, then the dynamic
                              programming algorithm operates on the matrix of distances with entries
                              D ij = D[(x 1 ,...,x i ), (y 1 ,...,y j )]. A little reflection renders the boundary
                              values

                                           D i0 = iδ,    D 0j = jδ,   i, j ≥ 0
                              obvious. These boundary values are fed into the recurrence

                                          =min{D i−1,j + δ, D i,j−1 + δ, D i−1,j−1 + d(x i ,y j )}, (13.5)
                                     D ij
                              whose proof follows the pattern of the proof of the recurrence (13.3). For
                              instance, if the optimal alignment between (x 1 ,... ,x i ) and (y 1 ,...,y j ) ends

                              with the pair  x i  , then we should take an optimal alignment up to this pair
                                           −
                              with cost D i−1,j and add the penalty d(x i , −). Taking the minimum over
                              the three possible final pairs yields the minimum distance of any alignment
                              between the partial strings (x 1 ,... ,x i ) and (y 1 ,... ,y j ).
                                The beauty of the dynamic programming algorithm is that it not only
                              provides the distance D(x, y) when we reach (i, j)=(m, n), but it does so
                              efficiently and yields an optimal alignment in the process. The number of
                              operations is of order O(mn) because the recurrence (13.5) accesses only
                              entries below and to the left of the current entry and consequently can be
   293   294   295   296   297   298   299   300   301   302   303