Page 41 - Applied Probability
P. 41
2. Counting Methods and the EM Algorithm
24
experiment [5, 9, 13]. Some function t(X)= Y collapses X onto Y .For
instance, if we represent X as (Y, Z), with Z as the missing data, then
t is simply projection onto the Y -component of X. It should be stressed
that the missing data can consist of more than just observations missing
in the ordinary sense. In fact, the definition of X is left up to the intuition
and cleverness of the statistician. The general idea is to choose X so that
maximum likelihood becomes trivial for the complete data.
The complete data are assumed to have a probability density f(X | θ)
that is a function of a parameter vector θ as well as of X. In the E step of
the EM algorithm, we calculate the conditional expectation
Q(θ | θ n )=E[ln f(X | θ) | Y, θ n ].
Here θ n is the current estimated value of θ. In the M step, we maximize
Q(θ | θ n ) with respect to θ. This yields the new parameter estimate θ n+1 ,
and we repeat this two-step process until convergence occurs. Note that θ
and θ n play fundamentally different roles in Q(θ | θ n ).
The essence of the EM algorithm is that maximizing Q(θ | θ n ) leads to an
increase in the loglikelihood ln g(Y | θ) of the observed data. This assertion
is proved in the following theoretical section, which can be omitted by
readers interested primarily in practical applications of the EM algorithm.
2.4 Ascent Property of the EM Algorithm
The entropy (or information) inequality at the heart of the EM algorithm
is a consequence of Jensen’s inequality, which relates convex functions to
expectations. Recall that a twice-differentiable function h(w) is convex on
an interval (a, b) if and only if h (w) ≥ 0 for all w in (a, b). If the defining
inequality is strict, then h(w) is said to be strictly convex.
Proposition 2.4.1 (Jensen’s Inequality) Let W be a random variable with
values confined to the possibly infinite interval (a, b).If E denotes ex-
pectation and h(w) is convex on (a, b), then E[h(W)] ≥ h[E(W)]. For a
strictly convex function, equality holds in Jensen’s inequality if and only if
W =E(W) almost surely.
Proof: Put u =E(W). For w in (a, b), we have
(w − u) 2
h(w) = h(u)+ h (u)(w − u)+ h (v)
2
≥ h(u)+ h (u)(w − u)
for some v between u and w. Note that v is in (a, b). Now substitute the
random variable W for the point w and take expectations. It follows that
E[h(W)] ≥ h(u)+ h (u)[E(W) − u]
= h(u).