Page 317 - Applied Probability
P. 317
14. Poisson Approximation
306
approximation
1
d
n
≈
ln(1 − d)
1
1
ln .
≈ − ln( + 1) + ln ln 2
d d
In fact, a detailed analysis shows that the average required number of
markers is asymptotically similar to 1 d ln 1 d for d small [8, 18]. The factor
ln 1 is the penalty exacted for randomly selecting markers.
d
The tedium of filling the last few gaps also plagues other mapping en-
deavors such as covering a chromosome by random clones of fixed length
d [20]. If we let the center of each clone correspond to a marker, then ex-
cept for edge effects, this problem is completely analogous to the marker
coverage problem.
14.6 Randomness of Restriction Sites
Restriction enzymes are special bacterial proteins that snip DNA. The
restriction sites where the cutting takes place vary from enzyme to en-
zyme. For instance, the restriction enzyme EcoRI recognizes the six-base
sequence GAATTC and snips DNA wherever this sequence appears. The
restriction enzyme NotI recognizes the rarer eight-base sequence GCGGC-
CGC and consequently tends to produce much longer fragments on average
than EcoRI. To a good approximation, the restriction sites for a particular
enzyme occur along a chromosome according to a homogeneous Poisson
process. Clustering of restriction sites is a particularly interesting violation
of the Poisson process assumptions.
If one visualizes n restriction sites along a stretch of DNA as random
points on the unit interval [0, 1], then under the Poisson process assump-
tion, the n points should constitute a random sample of size n from the
uniform distribution on [0, 1]. The distances between adjacent points are
known as spacings,or scans.An m-spacing is the distance between the
first and last point of m + 1 adjacent points. In Section 14.5, we approxi-
mated the distribution of the largest 1-spacing. Here we are interested in
detecting clustering by examining the smallest m-spacing S m from a set
of n restriction sites. Values of m> 1 are important because very short
DNA fragments are difficult to measure exactly. The Chen-Stein method
provides a means of assessing the significance of an observed m-spacing
S m = s [5, 13].
Consider the collection I of subsets α of size m + 1 from the set of n
random points on [0, 1]. Let X α be the indicator random variable of the
event that the distance from the first point of α to the last point of α is
n
less than or equal to s. There are |I| = such collections α, and each
m+1