Page 295 - Applied Probability
P. 295

13. Sequence Analysis
                                                    (m)
                                                       = R (m) in equation (13.1).
                              using the obvious fact R
                                One can test the Poisson hypothesis in various ways [12]. For instance,
                              the total number of clumps T should be approximately normal with mean
                              and variance nλ. If we divide the interval [0,n]into k equal subintervals
                              and let T i denote the number of clumps in the ith subinterval, then the
                              normal approximation to the Poisson suggests that             283
                                                            k           2
                                                           	  (T i − nλ/k)
                                                    S  =
                                                                  nλ/k
                                                           i=1
                              should be approximately chi-square with k degrees of freedom. If we esti-
                              mate λ by T/n, then the degrees of freedom drop by 1. As a rule of thumb,
                              k should satisfy T/k ≥ 5.
                                Alternatively, we can condition on T and check whether the T points are
                              uniformly distributed over [0,n]. The standard Kolmogorov-Smirnov test
                              assesses the largest deviation between the empirical distribution function
                              of the points and the uniform distribution on [0,n]. Finally, the interclump
                              distances are approximately exponentially distributed. This suggests test-
                              ing the smallest and largest interclump distances against their respective
                              exponential and extreme value distributions. Section 14.6 develops these
                              statistics and more elaborate test statistics.



                              13.3 Alphabets, Strings, and Alignments


                              How can we quantify the similarity of two finite strings from a com-
                              mon alphabet? One fruitful approach is to introduce a distance function
                              on strings. In mathematical analysis, distance is measured by a metric
                              d(x, y) ≥ 0 having the properties (a) d(x, y)= d(y, x), (b) d(x, y)= 0if
                              and only if x = y, and (c) d(x, z) ≤ d(x, y)+ d(y, z). Property (c) is called
                              the triangle inequality. We will assume that a metric d(x, y) is defined on
                              the underlying alphabet; for many purposes, the trivial choice d(x, y)= 1
                              for x  = y is adequate. Once we have a metric on the alphabet, then we
                              can define a metric called the Levenshtein metric on finite strings from
                              the alphabet. Before attempting to construct the Levenshtein metric, a few
                              motivating remarks will be helpful.
                                First, even in genetics more than one alphabet is of interest. The obvious
                              choice is the four letter alphabet A, C, T, and G. This reduces to a two letter
                              alphabet if we simply record whether a base is a purine or a pyrimidine.
                              A third alphabet is the 20 letter alphabet composed of the different amino
                              acids.
                                Second, we need to exercise some care in defining the term alignment.
                              Consider the two English words GENETICS and GENOTYPES and their
   290   291   292   293   294   295   296   297   298   299   300