Page 341 -
P. 341
Section 10.5 Fitting Using Probabilistic Models 309
10.5.2 Mixture Models and Hidden Variables
Each of the previous examples are instances of a general form of model, known
as a mixture model, where a data item is generated by first choosing a mixture
component (the line or the outlier; which segment the pixel comes from), then
generating the data item from that component. Call the parameters for the lth
component θ l , the probability of choosing the lth component π l , and write Θ =
(π 1 ,... ,π l ,θ 1 ,...,θ l ). Then, we can write the probability of generating x
p(x|Θ) = p(x|θ j )π j .
j
This is a weighted sum, or mixture, of probability models; the π l are usually called
mixing weights. One can visualize this model as a density in the space of x that
consists of a set of g “blobs” of probability, each of which is associated with a
component of the model. We want to determine: (a) the parameters of each of
these blobs, (b) the mixing weights, and usually (c) from which component each
token came. The log-likelihood of the data for a general mixture model is
⎛ ⎞
g
L(Θ) = log ⎝ π j p j (x i |θ j ) ⎠ .
i∈observations j=1
This function is hard to maximize, because of the sum inside the logarithm. Just
like the last two examples, the problem would be simplified if we knew the mix-
ture component from which each token came, because then we would estimate the
components independently.
We now introduce a new set of variables. For each data item, we have a vector
of indicator variables (one per component) that tells us from which component each
data item came. We write δ i for the vector associated with the ith data item, and
δ ij for the j’th component of δ i . Then, we have
1 if item i came from component j
δ ij = .
0 otherwise
and these variables are unknown. If we did know these variables, we could maximize
the complete data log-likelihood,
L c (Θ) = log P(x i ,δ i |Θ),
i∈observations
which would be quite easy to do (because it would boil down to estimating the
components independently). We regard δ as part of our data that happens to be
missing (which is why we call this the complete data log-likelihood). The form of
L c (Θ) for mixture models is worth remembering because it involves a neat trick: