Page 39 - Applied Probability
P. 39
2. Counting Methods and the EM Algorithm
22
estimate the frequency p M of the M allele, we count two M genes for each
M phenotype and one M gene for each MN phenotype. Thus, our estimate
2×13+76
2×119+76
= .755. Similarly, ˆ p N =
= .245. Note that
of p M is ˆ p M =
2×208
2×208
ˆ p M +ˆ p N =1.
In general, at a locus with k codominant alleles, suppose we count n i
alleles of type i in a random sample of n unrelated people. Then the ratio
ˆ p i = n i provides a desirable estimate of the frequency p i of allele i. Since
2n
the counts (n 1 ,...,n k ) follow a multinomial distribution, the expectation
E(ˆ p i )= 2np i = p i . In other words, ˆ p i is an unbiased estimator. By the
2n
strong law of large numbers, ˆ p i is also a strongly consistent estimator [6].
In passing, we also note the variance and covariance expressions
2np i (1 − p i )
Var(ˆ p i )=
(2n) 2
p i (1 − p i )
=
2n
2np ip j
Cov(ˆ p i , ˆ p j )= −
(2n) 2
p i p j
= − .
2n
Finally, as observed in Problem 3, the ˆ p i constitute the maximum likelihood
estimates of the p i .
This simple gene-counting argument encounters trouble if we consider a
locus with recessive alleles because we can no longer infer genotypes from
phenotypes. Consider the ABO locus, for instance. Suppose we observe n A
people of type A, n B people of type B, n AB people of type AB, and n O
people of type O. Let n = n A +n B +n AB +n O be the total number of people
in the random sample. If we want to estimate the frequency p A of the A
allele, we cannot say exactly how many of the n A people are homozygotes
A/A and how many are heterozygotes A/O. Thus, we are prevented from
directly counting genes.
There is a way out of this dilemma that exploits Hardy-Weinberg equi-
librium. If we knew the true allele frequencies p A and p O , then we could
correctly apportion the n A individuals of phenotype type A. Genotype A/A
has frequency p 2 in the population, while genotype A/O has frequency
A
2
2
2p Ap O . Of the n A people of type A, we expect n A/A = n A p /(p +2p Ap O )
A
A
2
people to have genotype A/A and n A/O = n A 2p Ap O /(p +2p Ap O ) people
A
to have genotype A/O. Employing circular reasoning, we now estimate p A
by
2n A/A + n A/O + n AB
ˆ p A = . (2.1)
2n
The trick now is to remove the circularity by iterating. Suppose we make
an initial guess p mA , p mB , and p mO of the three allele frequencies at it-
eration 0. By analogy to the reasoning leading to (2.1), we attribute at