Page 86 - Applied Probability
P. 86
69
4. Hypothesis Testing and Categorical Data
Iterating this permutation procedure r times generates an independent,
random sample Z 1 ,... ,Z r from the Fisher-Yates distribution. In practice,
it suffices to permute all rows except the bottom row m because haplotype
counts do not depend on the order of the haplotypes in a haplotype matrix
such as (4.5). Given the observed value T obs of a test statistic T for linkage
equilibrium, we estimate the corresponding p-value by the sample average
1 r } .
r l=1 1 {T (Z l )≥T obs
In Fisher’s exact test, the statistic T is the negative of the Fisher-Yates
probability (4.4). Thus, the null hypothesis of linkage equilibrium (inde-
pendence) is rejected if the observed Fisher-Yates probability is too low.
The chi-square statistic i [n i −E(n i )] 2 is also reasonable for testing inde-
E(n i )
pendence, provided we estimate its p-value by random sampling and do
not foolishly rely on the standard chi-square approximation. As noted in
Problem 8, the expectation E(n i )= n m /n).
j=1 (n ji j
Example 4.6.1 Chromosome-11 Haplotype Data
Weir and Brooks [45] construct 184 haplotypes on 8 chromosome-11
markers from phenotype data on 24 Utah pedigrees. Omitting the two
markers BEGl-Hind3 and ADJ-BCl and the two individuals 1353-8600 and
1355-8516 due to incomplete typing, we wind up with 180 full haplotypes
on 6 pertinent markers. These markers possess 2, 2, 10, 5, 3, and 2 alleles,
respectively. The data can be summarized in a six-dimensional contingency
table by giving the counts n i for each possible haplotype i =(i 1 ,...,i 6).
Since there are 2 × 2 × 10 × 5 × 3 × 2=1, 200 haplotypes in all, the table
is very sparse, and large sample methods of testing linkage equilibrium are
2
suspect. The chi-square statistic χ = [n i −E(n i )] 2 has an observed value
i E(n i )
of 1,517 for these data. This corresponds to a large sample p-value of es-
sentially 0. On the other hand, the empirical p-value calculated from 3,999
2
independent samples of the χ statistic is .1332 ± .0057 [24]. Although the
grossly misleading large sample result is hardly surprising in this extreme
case, it does remind us of the limitations of large sample approximations
and the remedies offered by modern computing.
Readers should be aware that there are other methods for calculating p-
values associated with exact tests on contingency tables. Agresti [1] surveys
the deterministic algorithms useful on small to intermediate-sized tables.
For the large, sparse tables encountered in testing Hardy-Weinberg and
linkage equilibrium, Markov chain Monte Carlo methods can be even faster
than the random permutation method described above [16, 24].
4.7 Case-Control Association Tests
With little change, the same analysis applies to case-control association
studies. In this setting two factors appear, disease status and genotype.