Page 298 - Applied Probability
P. 298
13. Sequence Analysis
286
The first two deletions exact penalty δ each, the next two substitutions ex-
act penalties d(D, R) and d(U, E), and the last two insertions exact penalty
δ each. This tallies with the distance 4δ + d(D, R)+ d(U, E) attributed to
the alignment
IN - - DU STRY
INTE R E ST - -
In general, each listing provides a path from x to y corresponding to a
unique alignment. The length of the path is the sum the penalties exacted
along the way; this length perfectly matches the distance of the correspond-
ing alignment if no repetitious operations such as the insertion and deletion
of the same blank are carried out. The distance D(x, y) is the length of
the minimal path from x to y. The relevance of these observations to prov-
ing the triangle inequality lies in appending a minimal path from y to z
to a minimal path from x to y. The length of the concatenated path is
D(x, y)+ D(y, z). Being one of many paths connecting x and z, the con-
catenated path has length bounded below by D(x, z). Thus, the triangle
property is true.
One of the difficulties with the Levenshtein metric is that it is implic-
itly given as the solution to a minimization problem over the large space
of alignments. Even though we cannot give an explicit formula for the
metric, we can calculate its value with reasonable efficiency via a dy-
namic programming algorithm introduced by Needleman and Wunsch [6].
If x =(x 1 ,...,x m ) and y =(y 1 ,...,y n ) are two strings, then the dynamic
programming algorithm operates on the matrix of distances with entries
D ij = D[(x 1 ,...,x i ), (y 1 ,...,y j )]. A little reflection renders the boundary
values
D i0 = iδ, D 0j = jδ, i, j ≥ 0
obvious. These boundary values are fed into the recurrence
=min{D i−1,j + δ, D i,j−1 + δ, D i−1,j−1 + d(x i ,y j )}, (13.5)
D ij
whose proof follows the pattern of the proof of the recurrence (13.3). For
instance, if the optimal alignment between (x 1 ,... ,x i ) and (y 1 ,...,y j ) ends
with the pair x i , then we should take an optimal alignment up to this pair
−
with cost D i−1,j and add the penalty d(x i , −). Taking the minimum over
the three possible final pairs yields the minimum distance of any alignment
between the partial strings (x 1 ,... ,x i ) and (y 1 ,... ,y j ).
The beauty of the dynamic programming algorithm is that it not only
provides the distance D(x, y) when we reach (i, j)=(m, n), but it does so
efficiently and yields an optimal alignment in the process. The number of
operations is of order O(mn) because the recurrence (13.5) accesses only
entries below and to the left of the current entry and consequently can be