Page 39 - Applied Probability
P. 39

2. Counting Methods and the EM Algorithm
                              22
                              estimate the frequency p M of the M allele, we count two M genes for each
                              M phenotype and one M gene for each MN phenotype. Thus, our estimate
                                                                         2×13+76
                                            2×119+76
                                                    = .755. Similarly, ˆ p N =
                                                                                = .245. Note that
                              of p M is ˆ p M =
                                             2×208
                                                                          2×208
                               ˆ p M +ˆ p N =1.
                                In general, at a locus with k codominant alleles, suppose we count n i
                              alleles of type i in a random sample of n unrelated people. Then the ratio
                               ˆ p i =  n i  provides a desirable estimate of the frequency p i of allele i. Since
                                   2n
                              the counts (n 1 ,...,n k ) follow a multinomial distribution, the expectation
                              E(ˆ p i )=  2np i  = p i . In other words, ˆ p i is an unbiased estimator. By the
                                       2n
                              strong law of large numbers, ˆ p i is also a strongly consistent estimator [6].
                              In passing, we also note the variance and covariance expressions
                                                                 2np i (1 − p i )
                                                     Var(ˆ p i )=
                                                                    (2n) 2
                                                                 p i (1 − p i )
                                                             =
                                                                    2n
                                                                   2np ip j
                                                  Cov(ˆ p i , ˆ p j )= −
                                                                   (2n) 2
                                                                   p i p j
                                                             = −      .
                                                                   2n
                              Finally, as observed in Problem 3, the ˆ p i constitute the maximum likelihood
                              estimates of the p i .
                                This simple gene-counting argument encounters trouble if we consider a
                              locus with recessive alleles because we can no longer infer genotypes from
                              phenotypes. Consider the ABO locus, for instance. Suppose we observe n A
                              people of type A, n B people of type B, n AB people of type AB, and n O
                              people of type O. Let n = n A +n B +n AB +n O be the total number of people
                              in the random sample. If we want to estimate the frequency p A of the A
                              allele, we cannot say exactly how many of the n A people are homozygotes
                              A/A and how many are heterozygotes A/O. Thus, we are prevented from
                              directly counting genes.
                                There is a way out of this dilemma that exploits Hardy-Weinberg equi-
                              librium. If we knew the true allele frequencies p A and p O , then we could
                              correctly apportion the n A individuals of phenotype type A. Genotype A/A
                              has frequency p 2  in the population, while genotype A/O has frequency
                                             A
                                                                                 2
                                                                                     2
                              2p Ap O . Of the n A people of type A, we expect n A/A = n A p /(p +2p Ap O )
                                                                                     A
                                                                                 A
                                                                               2
                              people to have genotype A/A and n A/O = n A 2p Ap O /(p +2p Ap O ) people
                                                                               A
                              to have genotype A/O. Employing circular reasoning, we now estimate p A
                              by
                                                         2n A/A + n A/O + n AB
                                                 ˆ p A  =                   .              (2.1)
                                                                 2n
                                The trick now is to remove the circularity by iterating. Suppose we make
                              an initial guess p mA , p mB , and p mO of the three allele frequencies at it-
                              eration 0. By analogy to the reasoning leading to (2.1), we attribute at
   34   35   36   37   38   39   40   41   42   43   44