Page 298 - Applied Probability

P. 298

13. Sequence Analysis
286
The ﬁrst two deletions exact penalty δ each, the next two substitutions ex-
act penalties d(D, R) and d(U, E), and the last two insertions exact penalty
δ each. This tallies with the distance 4δ + d(D, R)+ d(U, E) attributed to
the alignment
IN - - DU STRY
INTE R E ST - -
In general, each listing provides a path from x to y corresponding to a
unique alignment. The length of the path is the sum the penalties exacted
along the way; this length perfectly matches the distance of the correspond-
ing alignment if no repetitious operations such as the insertion and deletion
of the same blank are carried out. The distance D(x, y) is the length of
the minimal path from x to y. The relevance of these observations to prov-
ing the triangle inequality lies in appending a minimal path from y to z
to a minimal path from x to y. The length of the concatenated path is
D(x, y)+ D(y, z). Being one of many paths connecting x and z, the con-
catenated path has length bounded below by D(x, z). Thus, the triangle
property is true.
One of the diﬃculties with the Levenshtein metric is that it is implic-
itly given as the solution to a minimization problem over the large space
of alignments. Even though we cannot give an explicit formula for the
metric, we can calculate its value with reasonable eﬃciency via a dy-
namic programming algorithm introduced by Needleman and Wunsch [6].
If x =(x 1 ,...,x m ) and y =(y 1 ,...,y n ) are two strings, then the dynamic
programming algorithm operates on the matrix of distances with entries
D ij = D[(x 1 ,...,x i ), (y 1 ,...,y j )]. A little reﬂection renders the boundary
values

D i0 = iδ, D 0j = jδ, i, j ≥ 0
obvious. These boundary values are fed into the recurrence

=min{D i−1,j + δ, D i,j−1 + δ, D i−1,j−1 + d(x i ,y j )}, (13.5)
D ij
whose proof follows the pattern of the proof of the recurrence (13.3). For
instance, if the optimal alignment between (x 1 ,... ,x i ) and (y 1 ,...,y j ) ends

with the pair x i , then we should take an optimal alignment up to this pair
−
with cost D i−1,j and add the penalty d(x i , −). Taking the minimum over
the three possible ﬁnal pairs yields the minimum distance of any alignment
between the partial strings (x 1 ,... ,x i ) and (y 1 ,... ,y j ).
The beauty of the dynamic programming algorithm is that it not only
provides the distance D(x, y) when we reach (i, j)=(m, n), but it does so
eﬃciently and yields an optimal alignment in the process. The number of
operations is of order O(mn) because the recurrence (13.5) accesses only
entries below and to the left of the current entry and consequently can be

293 294 295 296 297 298 299 300 301 302 303