Page 79 - Applied Probability
P. 79
4. Hypothesis Testing and Categorical Data
62
Under the Hardy-Weinberg restrictions, Problem 2 of Chapter 2 shows
how to compute the maximum likelihood estimates of the allele frequency
p b . Testing Hardy-Weinberg equilibrium with the data f B = 9032, f b = 40,
m B = 8324, and m b = 725 requires computing the approximate chi-square
statistic
ˆ r
ˆ q f B f b m B m b
ˆ q ˆ r
2 B b B b
χ 1 =2 ln
(1 − ˆ p ) B (ˆ p ) b (1 − ˆ p b ) m B (ˆ p b ) m b
2 f
2 f
b b
ˆ q B ˆ q b ˆ r B ˆ r b
=2f B ln +2f b ln +2m B ln +2m b ln
1 − ˆ p 2 ˆ p 2 1 − ˆ p b ˆ p b
b b
= 2 (14.115 − 12.081 − 26.144 + 26.669)
=5.118.
This chi-square statistic has 2 − 1 = 1 degree of freedom and is significant
at the .025 level. In fact, there are two different common forms of color
blindness in humans. A two-locus X-linked model does provide an adequate
fit to these data.
4.3 Other Multinomial Problems in Genetics
Historically, chi-square tests have been the preferred method of testing hy-
potheses about multinomial data with known probabilities per category.
Chi-square tests are appropriate when no clear alternative suggests itself.
However, in many genetics problems the most reasonable alternative is
some type of clustering of observations in one or a few categories. In such
situations, tests for detecting excess counts in a few categories should be
conducted. Ewens et al. [12] highlight the Z max test in an application to
in situ hybridization, a form of physical mapping of genes to particular
chromosome regions. This application is characterized by fairly large ob-
served counts in most categories and an excess count in a single category.
Other applications, such as measuring the nonrandomness of chromosome
breakpoints in cancer [9], involve lower counts per category and excess
counts in several categories.
For relatively sparse multinomial data with known but unequal proba-
bilities per category, other statistics besides Z max are useful. For instance,
the number of categories W d with d or more observations can be a sensitive
indicator of clustering. Problems in detecting nonrandomness in mutations
in different proteins or in amino acids along a single protein afford interest-
ing opportunities for applying the W d statistic [17, 43]. When the variance
and mean of W d are approximately equal, then W d is approximately Pois-
son [4, 22]. In practice, this asymptotic approximation should be checked
by applying an exact numerical algorithm for computing p-values.