Page 280 -
P. 280
5.3 Mean shift and mode finding 259
Comaniciu and Meer (2002) prove that this algorithm converges to a local maximum of f(x)
under reasonably weak conditions on the kernel k(r), i.e., that it is monotonically decreasing.
This convergence is not guaranteed for regular gradient descent unless appropriate step size
control is used.
The two kernels that Comaniciu and Meer (2002) studied are the Epanechnikov kernel,
k E (r) = max(0, 1 − r), (5.40)
which is a radial generalization of a bilinear kernel, and the Gaussian (normal) kernel,
1
k N (r) = exp − r . (5.41)
2
The corresponding derivative kernels g(r) are a unit ball and another Gaussian, respectively.
Using the Epanechnikov kernel converges in a finite number of steps, while the Gaussian
kernel has a smoother trajectory (and produces better results), but converges very slowly near
a mode (Exercise 5.5).
The simplest way to apply mean shift is to start a separate mean-shift mode estimate
y at every input point x i and to iterate for a fixed number of steps or until the mean-shift
magnitude is below a threshold. A faster approach is to randomly subsample the input points
x i and to keep track of each point’s temporal evolution. The remaining points can then be
classified based on the nearest evolution path (Comaniciu and Meer 2002). Paris and Durand
(2007) review a number of other more efficient implementations of mean shift, including their
own approach, which is based on using an efficient low-resolution estimate of the complete
multi-dimensional space of f(x) along with its stationary points.
The color-based segmentation shown in Figure 5.16 only looks at pixel colors when deter-
mining the best clustering. It may therefore cluster together small isolated pixels that happen
to have the same color, which may not correspond to a semantically meaningful segmentation
of the image.
Better results can usually be obtained by clustering in the joint domain of color and lo-
cation. In this approach, the spatial coordinates of the image x s =(x, y), which are called
the spatial domain, are concatenated with the color values x r , which are known as the range
domain, and mean-shift clustering is applied in this five-dimensional space x j . Since location
and color may have different scales, the kernels are adjusted accordingly, i.e., we use a kernel
of the form
2 2
x r x s
K(x j )= k k , (5.42)
h 2 r h 2 s
where separate parameters h s and h r are used to control the spatial and range bandwidths of
the filter kernels. Figure 5.18 shows an example of mean-shift clustering in the joint domain,
with parameters (h s ,h r ,M) = (16, 19, 40), where spatial regions containing less than M
pixels are eliminated.
The form of the joint domain filter kernel (5.42) is reminiscent of the bilateral filter kernel
(3.34–3.37) discussed in Section 3.3.1. The difference between mean shift and bilateral fil-
tering, however, is that in mean shift the spatial coordinates of each pixel are adjusted along
with its color values, so that the pixel migrates more quickly towards other pixels with similar
colors, and can therefore later be used for clustering and segmentation.
Determining the best bandwidth parameters h to use with mean shift remains something
of an art, although a number of approaches have been explored. These include optimizing