Page 79 - Applied Probability
P. 79

4. Hypothesis Testing and Categorical Data
                              62
                                Under the Hardy-Weinberg restrictions, Problem 2 of Chapter 2 shows
                              how to compute the maximum likelihood estimates of the allele frequency
                              p b . Testing Hardy-Weinberg equilibrium with the data f B = 9032, f b = 40,
                              m B = 8324, and m b = 725 requires computing the approximate chi-square
                              statistic
                                                                 ˆ r
                                                       ˆ q  f B f b m B m b
                                                          ˆ q ˆ r
                                     2                  B  b  B  b
                                    χ 1  =2 ln
                                               (1 − ˆ p ) B (ˆ p ) b (1 − ˆ p b ) m B (ˆ p b ) m b
                                                          2 f
                                                    2 f
                                                    b     b
                                                   ˆ q B       ˆ q b       ˆ r B        ˆ r b
                                        =2f B ln       +2f b ln  +2m B ln       +2m b ln
                                                 1 − ˆ p 2     ˆ p 2      1 − ˆ p b     ˆ p b
                                                      b         b
                                        =  2 (14.115 − 12.081 − 26.144 + 26.669)
                                        =5.118.
                              This chi-square statistic has 2 − 1 = 1 degree of freedom and is significant
                              at the .025 level. In fact, there are two different common forms of color
                              blindness in humans. A two-locus X-linked model does provide an adequate
                              fit to these data.
                              4.3 Other Multinomial Problems in Genetics
                              Historically, chi-square tests have been the preferred method of testing hy-
                              potheses about multinomial data with known probabilities per category.
                              Chi-square tests are appropriate when no clear alternative suggests itself.
                              However, in many genetics problems the most reasonable alternative is
                              some type of clustering of observations in one or a few categories. In such
                              situations, tests for detecting excess counts in a few categories should be
                              conducted. Ewens et al. [12] highlight the Z max test in an application to
                              in situ hybridization, a form of physical mapping of genes to particular
                              chromosome regions. This application is characterized by fairly large ob-
                              served counts in most categories and an excess count in a single category.
                              Other applications, such as measuring the nonrandomness of chromosome
                              breakpoints in cancer [9], involve lower counts per category and excess
                              counts in several categories.
                                For relatively sparse multinomial data with known but unequal proba-
                              bilities per category, other statistics besides Z max are useful. For instance,
                              the number of categories W d with d or more observations can be a sensitive
                              indicator of clustering. Problems in detecting nonrandomness in mutations
                              in different proteins or in amino acids along a single protein afford interest-
                              ing opportunities for applying the W d statistic [17, 43]. When the variance
                              and mean of W d are approximately equal, then W d is approximately Pois-
                              son [4, 22]. In practice, this asymptotic approximation should be checked
                              by applying an exact numerical algorithm for computing p-values.
   74   75   76   77   78   79   80   81   82   83   84