Page 217 -
P. 217
196 4 Feature detection and matching
Figure 4.16 Feature matching: how can we extract local descriptors that are invariant to inter-image variations
and yet still discriminative enough to establish correct correspondences?
and motion. Tuytelaars and Van Gool (2004) use affine invariant regions to detect corre-
spondences for wide baseline stereo matching, whereas Kadir, Zisserman, and Brady (2004)
detect salient regions where patch entropy and its rate of change with scale are locally max-
imal. Corso and Hager (2005) use a related technique to fit 2D oriented Gaussian kernels
to homogeneous regions. More details on techniques for finding and matching curves, lines,
and regions can be found later in this chapter.
4.1.2 Feature descriptors
After detecting features (keypoints), we must match them, i.e., we must determine which
features come from corresponding locations in different images. In some situations, e.g., for
video sequences (Shi and Tomasi 1994) or for stereo pairs that have been rectified (Zhang,
Deriche, Faugeras et al. 1995; Loop and Zhang 1999; Scharstein and Szeliski 2002), the lo-
cal motion around each feature point may be mostly translational. In this case, simple error
metrics, such as the sum of squared differences or normalized cross-correlation, described
in Section 8.1 can be used to directly compare the intensities in small patches around each
feature point. (The comparative study by Mikolajczyk and Schmid (2005), discussed below,
uses cross-correlation.) Because feature points may not be exactly located, a more accurate
matching score can be computed by performing incremental motion refinement as described
in Section 8.1.3 but this can be time consuming and can sometimes even decrease perfor-
mance (Brown, Szeliski, and Winder 2005).
In most cases, however, the local appearance of features will change in orientation and
scale, and sometimes even undergo affine deformations. Extracting a local scale, orientation,
or affine frame estimate and then using this to resample the patch before forming the feature
descriptor is thus usually preferable (Figure 4.17).
Even after compensating for these changes, the local appearance of image patches will
usually still vary from image to image. How can we make image descriptors more invariant to
such changes, while still preserving discriminability between different (non-corresponding)
patches (Figure 4.16)? Mikolajczyk and Schmid (2005) review some recently developed
view-invariant local image descriptors and experimentally compare their performance. Be-
low, we describe a few of these descriptors in more detail.
Bias and gain normalization (MOPS). For tasks that do not exhibit large amounts of
foreshortening, such as image stitching, simple normalized intensity patches perform reason-
ably well and are simple to implement (Brown, Szeliski, and Winder 2005) (Figure 4.17). In