Page 317 - Applied Probability

P. 317

14. Poisson Approximation
306
approximation
1
d
n
≈
ln(1 − d)
1
1
ln .
≈ − ln( + 1) + ln ln 2
d d
In fact, a detailed analysis shows that the average required number of
markers is asymptotically similar to 1 d ln 1 d for d small [8, 18]. The factor
ln 1 is the penalty exacted for randomly selecting markers.
d
The tedium of ﬁlling the last few gaps also plagues other mapping en-
deavors such as covering a chromosome by random clones of ﬁxed length
d [20]. If we let the center of each clone correspond to a marker, then ex-
cept for edge eﬀects, this problem is completely analogous to the marker
coverage problem.
14.6 Randomness of Restriction Sites
Restriction enzymes are special bacterial proteins that snip DNA. The
restriction sites where the cutting takes place vary from enzyme to en-
zyme. For instance, the restriction enzyme EcoRI recognizes the six-base
sequence GAATTC and snips DNA wherever this sequence appears. The
restriction enzyme NotI recognizes the rarer eight-base sequence GCGGC-
CGC and consequently tends to produce much longer fragments on average
than EcoRI. To a good approximation, the restriction sites for a particular
enzyme occur along a chromosome according to a homogeneous Poisson
process. Clustering of restriction sites is a particularly interesting violation
of the Poisson process assumptions.
If one visualizes n restriction sites along a stretch of DNA as random
points on the unit interval [0, 1], then under the Poisson process assump-
tion, the n points should constitute a random sample of size n from the
uniform distribution on [0, 1]. The distances between adjacent points are
known as spacings,or scans.An m-spacing is the distance between the
ﬁrst and last point of m + 1 adjacent points. In Section 14.5, we approxi-
mated the distribution of the largest 1-spacing. Here we are interested in
detecting clustering by examining the smallest m-spacing S m from a set
of n restriction sites. Values of m> 1 are important because very short
DNA fragments are diﬃcult to measure exactly. The Chen-Stein method
provides a means of assessing the signiﬁcance of an observed m-spacing
S m = s [5, 13].
Consider the collection I of subsets α of size m + 1 from the set of n
random points on [0, 1]. Let X α be the indicator random variable of the
event that the distance from the ﬁrst point of α to the last point of α is
n
less than or equal to s. There are |I| = such collections α, and each
m+1

312 313 314 315 316 317 318 319 320 321 322