Page 296 - Applied Probability
P. 296

13. Sequence Analysis
                              284
                              two alignments
                                                                          -
                                              GEN
                                                                       S
                                                                   P
                                              GENOT Y
                                                                       CS
                                                                    I
                                                         E
                                              GEN        E  T   I -  C  E S
                                                            T
                                              GENOT Y              P   E  S
                              The first alignment makes slightly more sense linguistically, but the second
                              alignment displays more aligned letter matches. Note that both alignments
                              involve the blank character -. In molecular applications, blank characters
                              are called insertions or deletions, or simply indels if there is no need to
                              distinguish between them. It is very convenient to add the blank character
                              to any alphabet. If there is metric d(x, y) defined on the alphabet, we will
                              extend the metric so that d(−,x)= d(x, −)= δ> 0 for some constant δ.
                              It is also convenient to extend every finite string x =(x 1 ,...,x m )tobean
                              infinite string by appending an infinite number of blank characters to its
                              right end.
                                Aligning strings involves the insertion of a finite number of leading or in-
                              ternal blanks rather than an infinite number of trailing blanks. However, we
                              are not allowed to insert leading and internal blanks willy-nilly in aligning

                              two strings; overlapping inserted blanks  −  are forbidden. This restriction
                                                                  −
                              implies that there are only a finite number of possible alignments of two
                              strings x =(x 1 ,... ,x m ) and y =(y 1 ,...,y n)with m and n letters, re-
                              spectively. Indeed, if the derived strings x and y provide an alignment of
                                                                   ∗
                                                                         ∗
                              x and y, then the nontrivial part of the alignment extends from position
                              1 to at most position m + n. These extreme alignments are achieved by
                              appending n leading blanks to x to create x and taking y = y or by
                                                                                    ∗
                                                                       ∗
                                                                      ∗
                              appending m leading blanks to y to create y and taking x = x.
                                                                                  ∗
                                To calculate the number of alignments f(m, n) between two given strings
                              x =(x 1 ,...,x m ) and y =(y 1 ,... ,y n ), we derive a recurrence relation
                              whose concomitant initial condition is
                                                  f(0,j)= f(i, 0) = 1                     (13.2)
                              for all nonnegative integers i and j. The recurrence relation (13.3) below
                              stems from the simple observation that the nontrivial part of any alignment

                              ends with one of the three pairs  x m  ,  −  ,or  x m  . The remainder of the
                                                            −    y n    y n
                              alignment to the left of one of these pairs constitutes a legitimate alignment
                              between two shorter strings. Hence,
                                     f(m, n)= f(m − 1,n)+ f(m, n − 1) + f(m − 1,n − 1). (13.3)
                              This recurrence relation is difficult to solve, but we can easily show that it
                              grows extremely rapidly.
                                               m+n      m+n
                                Let g(m, n)=       =       . Then it is clear that the binomial coeffi-
                                               m       n
                              cient g(m, n) satisfies the same initial condition (13.2) as f(m, n) but the
   291   292   293   294   295   296   297   298   299   300   301