Page 84 - Applied Probability
P. 84
4. Hypothesis Testing and Categorical Data
67
function that mutations in these amino acids are immediately eliminated
by evolution.
4.6 Exact Tests of Independence
The problem of testing linkage equilibrium is equivalent to a more gen-
eral statistical problem of testing for independence in contingency tables.
To translate into the usual statistical terminology, one need only equate
“locus” to “factor,” “allele” to “level,” and “linkage equilibrium” to “inde-
pendence.” In exact inference, one conditions on the marginal counts of a
contingency table. In the linkage equilibrium setting, this means condition-
ing on the allele counts at each locus. Suppose we sample n independent
haplotypes defined on m loci. Recall that a haplotype i =(i 1 ,... ,i m )is
just an m-tuple of allele choices at the participating loci. If the frequency
of allele k at locus j is p jk , then under linkage equilibrium the haplotype
i =(i 1 ,...,i m) has probability
m
p i = p ji j ,
j=1
and the haplotype counts {n i } from the sample follow a multinomial dis-
tribution with parameters (n, {p i }). The marginal allele counts {n jk } at
any locus j likewise follow a multinomial distribution with parameters
(n, {p jk }). Since under the null hypothesis of linkage equilibrium, marginal
counts are independent from locus to locus, the conditional distribution of
the haplotype counts is
n p n i
Pr({n i }|{n jk })= {n i } i i
m n (p jk ) n jk
j=1 {n jk } k
n
= m {n i } n . (4.4)
j=1 {n jk }
One of the pleasant facts of exact inference is that the multivariate Fisher-
Yates distribution (4.4) does not depend on the unknown allele frequen-
cies. Problem 8 indicates how to compute its moments [23].
We can also derive the Fisher-Yates distribution by a counting argument
involving a sample space distinct from the space of haplotype counts. Con-
sider an m × n matrix whose rows correspond to loci and whose columns
correspond to haplotypes. At locus j there are n genes with n jk genes rep-
resenting allele k. If we uniquely label each of these n genes, then there are
n! distinguishable permutations of the genes in row j. The uniform sample
space consists of the (n!) m matrices derived from the n! permutations of
m
each of the m rows. Each such matrix is assigned probability 1/(n!) .For