Page 295 - Applied Probability

P. 295

13. Sequence Analysis
(m)
= R (m) in equation (13.1).
using the obvious fact R
One can test the Poisson hypothesis in various ways [12]. For instance,
the total number of clumps T should be approximately normal with mean
and variance nλ. If we divide the interval [0,n]into k equal subintervals
and let T i denote the number of clumps in the ith subinterval, then the
normal approximation to the Poisson suggests that 283
k 2
(T i − nλ/k)
S =
nλ/k
i=1
should be approximately chi-square with k degrees of freedom. If we esti-
mate λ by T/n, then the degrees of freedom drop by 1. As a rule of thumb,
k should satisfy T/k ≥ 5.
Alternatively, we can condition on T and check whether the T points are
uniformly distributed over [0,n]. The standard Kolmogorov-Smirnov test
assesses the largest deviation between the empirical distribution function
of the points and the uniform distribution on [0,n]. Finally, the interclump
distances are approximately exponentially distributed. This suggests test-
ing the smallest and largest interclump distances against their respective
exponential and extreme value distributions. Section 14.6 develops these
statistics and more elaborate test statistics.

13.3 Alphabets, Strings, and Alignments

How can we quantify the similarity of two ﬁnite strings from a com-
mon alphabet? One fruitful approach is to introduce a distance function
on strings. In mathematical analysis, distance is measured by a metric
d(x, y) ≥ 0 having the properties (a) d(x, y)= d(y, x), (b) d(x, y)= 0if
and only if x = y, and (c) d(x, z) ≤ d(x, y)+ d(y, z). Property (c) is called
the triangle inequality. We will assume that a metric d(x, y) is deﬁned on
the underlying alphabet; for many purposes, the trivial choice d(x, y)= 1
for x = y is adequate. Once we have a metric on the alphabet, then we
can deﬁne a metric called the Levenshtein metric on ﬁnite strings from
the alphabet. Before attempting to construct the Levenshtein metric, a few
motivating remarks will be helpful.
First, even in genetics more than one alphabet is of interest. The obvious
choice is the four letter alphabet A, C, T, and G. This reduces to a two letter
alphabet if we simply record whether a base is a purine or a pyrimidine.
A third alphabet is the 20 letter alphabet composed of the diﬀerent amino
acids.
Second, we need to exercise some care in deﬁning the term alignment.
Consider the two English words GENETICS and GENOTYPES and their

290 291 292 293 294 295 296 297 298 299 300