Page 41 - Applied Probability
P. 41

2. Counting Methods and the EM Algorithm
                              24
                              experiment [5, 9, 13]. Some function t(X)= Y collapses X onto Y .For
                              instance, if we represent X as (Y, Z), with Z as the missing data, then
                              t is simply projection onto the Y -component of X. It should be stressed
                              that the missing data can consist of more than just observations missing
                              in the ordinary sense. In fact, the definition of X is left up to the intuition
                              and cleverness of the statistician. The general idea is to choose X so that
                              maximum likelihood becomes trivial for the complete data.
                                The complete data are assumed to have a probability density f(X | θ)
                              that is a function of a parameter vector θ as well as of X. In the E step of
                              the EM algorithm, we calculate the conditional expectation
                                               Q(θ | θ n )=E[ln f(X | θ) | Y, θ n ].
                              Here θ n is the current estimated value of θ. In the M step, we maximize
                              Q(θ | θ n ) with respect to θ. This yields the new parameter estimate θ n+1 ,
                              and we repeat this two-step process until convergence occurs. Note that θ
                              and θ n play fundamentally different roles in Q(θ | θ n ).
                                The essence of the EM algorithm is that maximizing Q(θ | θ n ) leads to an
                              increase in the loglikelihood ln g(Y | θ) of the observed data. This assertion
                              is proved in the following theoretical section, which can be omitted by
                              readers interested primarily in practical applications of the EM algorithm.


                              2.4 Ascent Property of the EM Algorithm

                              The entropy (or information) inequality at the heart of the EM algorithm
                              is a consequence of Jensen’s inequality, which relates convex functions to
                              expectations. Recall that a twice-differentiable function h(w) is convex on
                              an interval (a, b) if and only if h (w) ≥ 0 for all w in (a, b). If the defining

                              inequality is strict, then h(w) is said to be strictly convex.
                              Proposition 2.4.1 (Jensen’s Inequality) Let W be a random variable with
                              values confined to the possibly infinite interval (a, b).If E denotes ex-
                              pectation and h(w) is convex on (a, b), then E[h(W)] ≥ h[E(W)]. For a
                              strictly convex function, equality holds in Jensen’s inequality if and only if
                              W =E(W) almost surely.
                              Proof: Put u =E(W). For w in (a, b), we have
                                                                            (w − u) 2

                                          h(w)  = h(u)+ h (u)(w − u)+ h (v)

                                                                               2

                                                ≥ h(u)+ h (u)(w − u)
                              for some v between u and w. Note that v is in (a, b). Now substitute the
                              random variable W for the point w and take expectations. It follows that
                                              E[h(W)]  ≥ h(u)+ h (u)[E(W) − u]

                                                       = h(u).
   36   37   38   39   40   41   42   43   44   45   46