Page 341 -
P. 341

Section 10.5  Fitting Using Probabilistic Models  309


                     10.5.2 Mixture Models and Hidden Variables
                            Each of the previous examples are instances of a general form of model, known
                            as a mixture model, where a data item is generated by first choosing a mixture
                            component (the line or the outlier; which segment the pixel comes from), then
                            generating the data item from that component. Call the parameters for the lth
                            component θ l , the probability of choosing the lth component π l , and write Θ =
                            (π 1 ,... ,π l ,θ 1 ,...,θ l ). Then, we can write the probability of generating x


                                                     p(x|Θ) =    p(x|θ j )π j .
                                                               j
                            This is a weighted sum, or mixture, of probability models; the π l are usually called
                            mixing weights. One can visualize this model as a density in the space of x that
                            consists of a set of g “blobs” of probability, each of which is associated with a
                            component of the model. We want to determine: (a) the parameters of each of
                            these blobs, (b) the mixing weights, and usually (c) from which component each
                            token came. The log-likelihood of the data for a general mixture model is
                                                                  ⎛              ⎞
                                                                     g

                                            L(Θ) =             log  ⎝  π j p j (x i |θ j ) ⎠ .
                                                   i∈observations   j=1
                            This function is hard to maximize, because of the sum inside the logarithm. Just
                            like the last two examples, the problem would be simplified if we knew the mix-
                            ture component from which each token came, because then we would estimate the
                            components independently.
                                 We now introduce a new set of variables. For each data item, we have a vector
                            of indicator variables (one per component) that tells us from which component each
                            data item came. We write δ i for the vector associated with the ith data item, and
                            δ ij for the j’th component of δ i . Then, we have

                                                  1  if item i came from component j
                                           δ ij =                                   .
                                                  0             otherwise
                            and these variables are unknown. If we did know these variables, we could maximize
                            the complete data log-likelihood,


                                               L c (Θ) =           log P(x i ,δ i |Θ),
                                                       i∈observations
                            which would be quite easy to do (because it would boil down to estimating the
                            components independently). We regard δ as part of our data that happens to be
                            missing (which is why we call this the complete data log-likelihood). The form of
                            L c (Θ) for mixture models is worth remembering because it involves a neat trick:
   336   337   338   339   340   341   342   343   344   345   346