Page 86 - Applied Probability
P. 86

69
                                                      4. Hypothesis Testing and Categorical Data
                                Iterating this permutation procedure r times generates an independent,
                              random sample Z 1 ,... ,Z r from the Fisher-Yates distribution. In practice,
                              it suffices to permute all rows except the bottom row m because haplotype
                              counts do not depend on the order of the haplotypes in a haplotype matrix
                              such as (4.5). Given the observed value T obs of a test statistic T for linkage
                              equilibrium, we estimate the corresponding p-value by the sample average
                               1    r           } .
                               r  l=1  1 {T (Z l )≥T obs
                                In Fisher’s exact test, the statistic T is the negative of the Fisher-Yates
                              probability (4.4). Thus, the null hypothesis of linkage equilibrium (inde-
                              pendence) is rejected if the observed Fisher-Yates probability is too low.
                              The chi-square statistic    i  [n i −E(n i )] 2  is also reasonable for testing inde-
                                                          E(n i )
                              pendence, provided we estimate its p-value by random sampling and do
                              not foolishly rely on the standard chi-square approximation. As noted in
                              Problem 8, the expectation E(n i )= n    m  /n).
                                                                  j=1 (n ji j
                              Example 4.6.1 Chromosome-11 Haplotype Data
                                Weir and Brooks [45] construct 184 haplotypes on 8 chromosome-11
                              markers from phenotype data on 24 Utah pedigrees. Omitting the two
                              markers BEGl-Hind3 and ADJ-BCl and the two individuals 1353-8600 and
                              1355-8516 due to incomplete typing, we wind up with 180 full haplotypes
                              on 6 pertinent markers. These markers possess 2, 2, 10, 5, 3, and 2 alleles,
                              respectively. The data can be summarized in a six-dimensional contingency
                              table by giving the counts n i for each possible haplotype i =(i 1 ,...,i 6).
                              Since there are 2 × 2 × 10 × 5 × 3 × 2=1, 200 haplotypes in all, the table
                              is very sparse, and large sample methods of testing linkage equilibrium are
                                                            2
                              suspect. The chi-square statistic χ =     [n i −E(n i )] 2  has an observed value
                                                                  i  E(n i )
                              of 1,517 for these data. This corresponds to a large sample p-value of es-
                              sentially 0. On the other hand, the empirical p-value calculated from 3,999
                                                        2
                              independent samples of the χ statistic is .1332 ± .0057 [24]. Although the
                              grossly misleading large sample result is hardly surprising in this extreme
                              case, it does remind us of the limitations of large sample approximations
                              and the remedies offered by modern computing.
                                Readers should be aware that there are other methods for calculating p-
                              values associated with exact tests on contingency tables. Agresti [1] surveys
                              the deterministic algorithms useful on small to intermediate-sized tables.
                              For the large, sparse tables encountered in testing Hardy-Weinberg and
                              linkage equilibrium, Markov chain Monte Carlo methods can be even faster
                              than the random permutation method described above [16, 24].


                              4.7 Case-Control Association Tests


                              With little change, the same analysis applies to case-control association
                              studies. In this setting two factors appear, disease status and genotype.
   81   82   83   84   85   86   87   88   89   90   91