Page 297 - Applied Probability
P. 297

13. Sequence Analysis
                                                                                            285
                              simpler recurrence
                                              g(m, n)= g(m − 1,n)+ g(m, n − 1)
                                                                                          (13.4)
                              based on Pascal’s combinatorial triangle. We now argue inductively that
                              f(m, n) ≥ g(m, n). The induction is primed by the identical initial con-
                              dition (13.2). Assuming the inequality holds for all pairs of nonnegative
                              integers (m, n)with m + n = k − 1 > 0 for some integer k − 1, we advance
                              the induction from k − 1to k by neglecting the third term f(m − 1,n − 1)
                              on the right of recurrence (13.3), substituting g(m − 1,n) for f(m − 1,n)
                              and g(m, n − 1) for f(m, n − 1), and finally invoking recurrence (13.4).
                              This proves the inequality for all pairs (m, n)with m + n = k. The size
                              of g(n, n) can be accurately estimated by application of Stirling’s formula
                                   √      1
                              n! ∼   2πn n+ 2 e −n . A straightforward calculation shows that
                                                               2 2n
                                                      g(n, n) ∼ √  .
                                                                πn
                              13.4 Minimum Distance Alignment

                              To an alignment x and y of x and y, we assign the distance     d(x ,y ).
                                                                                          ∗
                                                                                             ∗
                                                     ∗
                                              ∗
                                                                                      i   i  i
                              Although this is an infinite sum, all but a finite number of terms are 0 owing
                              to the fact d(−, −) = 0. The Levenshtein distance between the original
                              strings x and y we now define as

                                                                        ∗
                                                D(x, y)  =   min     d(x ,y ),
                                                                           ∗
                                                                        i
                                                                           i
                                                            (x ,y )
                                                              ∗
                                                                ∗
                                                                   i
                                                 ∗
                                              ∗
                              where the pair (x , y ) ranges over all possible alignments. The symmetry
                              property (a) characterizing a metric follows directly from the definition of
                              D(x, y). The inequality D(x, y) ≥ 0 is also obvious. Equality holds when
                              x = y because d(x, y) is a metric. Strict inequality holds when x  = y
                                                                                    ∗
                                                                                       ∗
                              because there are only a finite number of alignment pairs (x , y ). Thus,
                              property (b) of a metric is also true.
                                The validity of the triangle inequality (c) is less obvious. One way of

                                                                                           ∗
                                                                                        ∗
                              proving it is to think of computing the alignment distance  d(x ,y )by
                                                                                    i   i  i
                              transforming x into y by a sequence of steps, each step carrying its own
                              penalty. Consider the example of converting the word INDUSTRY into the
                              word INTEREST [5] in the following listing
                                                IND       U   S  TRY
                                                IND       U   S  TR     -
                                                IND       U   S  T   -  -
                                                IN    R   U   S  T   -  -
                                                IN    R   E   S  T   -  -
                                                IN    T   R  E   S  T   -
                                                IN    T   E  R   E   S  T
   292   293   294   295   296   297   298   299   300   301   302