Page 49 - Applied Probability
P. 49

2. Counting Methods and the EM Algorithm
                              32
                              of the observed data is
                                               u
                                                          s−d+1
                                                     1



                                                                   p(k − j +1,y ik )
                                                                                     p(0,y ik ).
                                  g(Y | p)=
                                                 s − d +1
                                              i=1
                                                                                 k ∈I j
                                                           j=1 k∈I j
                              Here p(0,b) is the probability that any site outside the binding domain is
                              occupied by base b, and p(m, b), 1 ≤ m ≤ d, is the probability that position
                              m within the binding domain is occupied by base b. Each base is assigned
                              independently according to these probabilities, and each domain is assigned
                              an initial site independently and uniformly from the set {1,...,s − d +1}.
                                We identify the missing data with a matrix Z =(z ij ) of indicator random
                              variables. Entry z ij determines whether the binding domain in segment i
                              begins at site j . With this understanding, the loglikelihood of the complete
                              data X =(Y, Z) reduces to
                                                                                            
                                               u s−d+1

                               ln f(X | p)=           z ij    ln p(k − j +1,y ik )+  ln p(0,y ik ) .
                                                                                             
                                              i=1  j=1    k∈I j                 k ∈I j
                              Note here that only one of the z ij is nonzero for each i. The E step of the
                              EM algorithm evaluates the conditional expectation

                                                            p n (k − j +1,y ik )  p n (0,y ik )
                                                        k∈I j                k ∈I j
                                E(z ij | Y, p n )=
                                                   s−d+1       p n (k − m +1,y ik )     p n (0,y ik )
                                                   m=1    k∈I m                  k ∈I m
                              using Bayes’ rule. Adapting the reasoning of the ABO example, it is easy
                              to demonstrate that the M step gives
                                                            u s−d+1
                                                      1
                                     p n+1 (0,b)=                  E(z ij | Y, p n )  1 {y ik =b}
                                                   u(s − d)
                                                           i=1  j=1            k ∈I j
                                                      u s−d+1
                                                   1
                                    p n+1 (m, b)=            E(z ij | Y, p n)1 {y i,j+m−1 =b} .
                                                   u
                                                     i=1  j=1
                              On convergence, not only does the EM algorithm supply the background
                              probabilities p(0,b) and the domain probabilities p(m, b), but it also yields
                              for each site the posterior probability that the site initiates a binding do-
                              main.
                              2.8 Problems
                                 1. At some autosomal locus with two alleles R and r,let R be domi-
                                   nant to r. Suppose a random sample of n people contains n r people

                                   with the recessive genotype r/r. Prove that  n r /n is the maximum
                                   likelihood estimate of the frequency of allele r under Hardy-Weinberg
                                   equilibrium.
   44   45   46   47   48   49   50   51   52   53   54