Page 280 -

P. 280

5.3 Mean shift and mode ﬁnding 259

Comaniciu and Meer (2002) prove that this algorithm converges to a local maximum of f(x)
under reasonably weak conditions on the kernel k(r), i.e., that it is monotonically decreasing.
This convergence is not guaranteed for regular gradient descent unless appropriate step size
control is used.
The two kernels that Comaniciu and Meer (2002) studied are the Epanechnikov kernel,
k E (r) = max(0, 1 − r), (5.40)

which is a radial generalization of a bilinear kernel, and the Gaussian (normal) kernel,
1

k N (r) = exp − r . (5.41)
2
The corresponding derivative kernels g(r) are a unit ball and another Gaussian, respectively.
Using the Epanechnikov kernel converges in a ﬁnite number of steps, while the Gaussian
kernel has a smoother trajectory (and produces better results), but converges very slowly near
a mode (Exercise 5.5).
The simplest way to apply mean shift is to start a separate mean-shift mode estimate
y at every input point x i and to iterate for a ﬁxed number of steps or until the mean-shift
magnitude is below a threshold. A faster approach is to randomly subsample the input points
x i and to keep track of each point’s temporal evolution. The remaining points can then be
classiﬁed based on the nearest evolution path (Comaniciu and Meer 2002). Paris and Durand
(2007) review a number of other more efﬁcient implementations of mean shift, including their
own approach, which is based on using an efﬁcient low-resolution estimate of the complete
multi-dimensional space of f(x) along with its stationary points.
The color-based segmentation shown in Figure 5.16 only looks at pixel colors when deter-
mining the best clustering. It may therefore cluster together small isolated pixels that happen
to have the same color, which may not correspond to a semantically meaningful segmentation
of the image.
Better results can usually be obtained by clustering in the joint domain of color and lo-
cation. In this approach, the spatial coordinates of the image x s =(x, y), which are called
the spatial domain, are concatenated with the color values x r , which are known as the range
domain, and mean-shift clustering is applied in this ﬁve-dimensional space x j . Since location
and color may have different scales, the kernels are adjusted accordingly, i.e., we use a kernel
of the form
2 2

x r x s
K(x j )= k k , (5.42)
h 2 r h 2 s
where separate parameters h s and h r are used to control the spatial and range bandwidths of
the ﬁlter kernels. Figure 5.18 shows an example of mean-shift clustering in the joint domain,
with parameters (h s ,h r ,M) = (16, 19, 40), where spatial regions containing less than M
pixels are eliminated.
The form of the joint domain ﬁlter kernel (5.42) is reminiscent of the bilateral ﬁlter kernel
(3.34–3.37) discussed in Section 3.3.1. The difference between mean shift and bilateral ﬁl-
tering, however, is that in mean shift the spatial coordinates of each pixel are adjusted along
with its color values, so that the pixel migrates more quickly towards other pixels with similar
colors, and can therefore later be used for clustering and segmentation.
Determining the best bandwidth parameters h to use with mean shift remains something
of an art, although a number of approaches have been explored. These include optimizing

275 276 277 278 279 280 281 282 283 284 285