Page 42 - Applied Probability
P. 42

2. Counting Methods and the EM Algorithm
                                                                                             25
                                                                                     2
                                                                                (w−u)
                                                                                      is positive
                              If h(w) is strictly convex, then the neglected term h (v)

                                                                                  2
                              whenever w  = u.
                                The next proposition uses the language of measure theory. To assist
                              readers unfamiliar with the subject, it is fair to point out that in practical
                              applications most measures reduce to ordinary length, area, or volume for
                              continuous random variables and to counting for discrete random variables.
                              In the former case, we integrate, and in the later case, we sum. The term
                              “almost everywhere” is the counterpart of “almost surely” in probability
                              theory and can be interpreted as everywhere with little harm.
                              Proposition 2.4.2 (Entropy Inequality) Let f and g be probability densi-
                              ties with respect to a measure µ. Suppose f> 0 and g> 0 almost every-
                              where relative to µ.If E f denotes expectation with respect to the probability
                              measure fdµ, then E f (ln f) ≥ E f (ln g), with equality only if f = g almost
                              everywhere relative to µ.
                              Proof: Because −ln(w) is a strictly convex function on (0, ∞), Jensen’s
                              inequality applied to the random variable g/f implies
                                                                           g
                                              E f (ln f) − E f (ln g)=E f (−ln )
                                                                           f
                                                                           g
                                                                 ≥−ln E f ( )
                                                                           f
                                                                          g

                                                                 = −ln      fdµ
                                                                          f

                                                                 = −ln    gdµ
                                                                 =0.
                                                         g
                              Equality holds only if  g  =E f ( ) almost everywhere relative to µ. However
                                                  f      f
                                  g
                              E f ( )= 1.
                                  f
                                Reverting to the notation Q(θ | θ n ) = E[ln f(X | θ) | Y = y, θ n ] of the
                              EM algorithm, we next prove that
                                        Q(θ n | θ n ) − ln g(y | θ n )  ≥ Q(θ | θ n ) − ln g(y | θ)
                              for all θ and θ n , where g(y | θ) is the likelihood of the observed data Y = y.
                                                       f(x | θ)  f(x | θ n )
                              To this end, note that both    and        are conditional densities of
                                                       g(y | θ)  g(y | θ n )
                              X on {x: t(x)= y} with respect to some measure µ y . The entropy inequality
                              now indicates that

                                                                    f(X | θ)

                                     Q(θ | θ n ) − ln g(y | θ)  = E ln       | Y = y, θ n
                                                                    g(Y | θ)

                                                                    f(X | θ n )

                                                          ≤ E ln              | Y = y, θ n
                                                                    g(Y | θ n )
                                                          = Q(θ n | θ n ) − ln g(y | θ n ).
   37   38   39   40   41   42   43   44   45   46   47