Page 304 - Applied Probability
P. 304

13. Sequence Analysis
                              292
                                We first treat the case of a proportional gap score h(k)= −kδ. In de-
                              termining the best local alignment of two strings x =(x 1 ,...,x m ) and
                              y =(y 1 ,...,y n ), we define the matrix entry
                                       = max{0,S[(x i ,... ,x k ), (y j ,... ,y l )] : 1 ≤ i ≤ k, 1 ≤ j ≤ l}.
                                  H kl
                              In other words, H kl is the best similarity for an x substring ending with
                              x k against a y substring ending with y l . The presence of 0 in the defini-
                              tion of H kl indicates that a null string is a candidate substring in each case.
                              The Smith-Waterman algorithm computes H(x, y) = max kl H kl and a best
                              local alignment in the process. In computing the H kl by dynamic program-
                              ming, we have the boundary conditions H k0 = H 0l = 0 because aligning an
                              x (or y) substring against a null string gives a negative similarity score un-
                              less the x (or y) substring itself is null. Finally, the dynamic programming
                              algorithm feeds these boundary conditions into the recurrence
                                  H kl = max{0,H k−1,l − δ, H k,l−1 − δ, H k−1,l−1 + s(x k ,y l )}. (13.10)
                              To find a best local alignment, we establish a system of pointers. Tracing
                              these pointers back from the best pair (k, l), we either reach a boundary or
                              a pointer points to the 0 term in (13.10). In either case, we terminate the
                              optimal substrings on the left.
                                We can extend the Smith-Waterman algorithm to an affine gap score
                              h(k)= −α − (k − 1)β by introducing the maximum substring similarities

                              E kl , F kl , and G kl for substring alignments ending in  x k  ,  −  , and  x k  ,
                                                                               −   y l      y l
                              respectively. In this notation H kl = max{E kl ,F kl ,G kl }, and we have the
                              boundary conditions E kl = F kl = G kl = 0 if either k =0 or l = 0. The
                              corresponding recurrences are

                                     E kl  = max{0,E k−1,l − β, F k−1,l − α, G k−1,l − α}
                                      F kl  = max{0,E k,l−1 − α, F k,l−1 − β, G k,l−1 − α}
                                     G kl  = max{0,E k−1,l−1 + s, F k−1,l−1 + s, G k−1,l−1 + s},
                              where in the last equation we have abbreviated s = s(x k ,y l ).
                                Finally, to identify repeat units within a DNA sequence x, it is helpful to
                              compare x to itself. Since x matches itself perfectly, the best local alignment
                              is the trivial global alignment. To avoid drawing this conclusion in the
                              Smith-Waterman algorithm, we define the new boundary condition H ii =0
                              for all i and invoke the recurrence (13.10) only on the upper triangle of (H ij )
                              where i< j.


                              13.8 Multiple Sequence Comparisons


                              Alignment of three or more sequences is considerably harder than alignment
                              of two sequences [8, 12]. To see how the computational complexity of global
   299   300   301   302   303   304   305   306   307   308   309