Page 229 -
P. 229
208 4 Feature detection and matching
Figure 4.28 Feature tracking using an affine motion model (Shi and Tomasi 1994) c 1994 IEEE, Top row: image
patch around the tracked feature location. Bottom row: image patch after warping back toward the first frame
using an affine deformation. Even though the speed sign gets larger from frame to frame, the affine transformation
maintains a good resemblance between the original and subsequent tracked frames.
2002; Williams, Blake, and Cipolla 2003). These topics are all covered in more detail in
Section 8.1.3.
If features are being tracked over longer image sequences, their appearance can undergo
larger changes. You then have to decide whether to continue matching against the originally
detected patch (feature) or to re-sample each subsequent frame at the matching location. The
former strategy is prone to failure as the original patch can undergo appearance changes such
as foreshortening. The latter runs the risk of the feature drifting from its original location
to some other location in the image (Shi and Tomasi 1994). (Mathematically, small mis-
registration errors compound to create a Markov Random Walk, which leads to larger drift
over time.)
A preferable solution is to compare the original patch to later image locations using an
affine motion model (Section 8.2). Shi and Tomasi (1994) first compare patches in neigh-
boring frames using a translational model and then use the location estimates produced by
this step to initialize an affine registration between the patch in the current frame and the
base frame where a feature was first detected (Figure 4.28). In their system, features are only
detected infrequently, i.e., only in regions where tracking has failed. In the usual case, an
area around the current predicted location of the feature is searched with an incremental reg-
istration algorithm (Section 8.1.3). The resulting tracker is often called the Kanade–Lucas–
Tomasi (KLT) tracker.
Since their original work on feature tracking, Shi and Tomasi’s approach has generated a
string of interesting follow-on papers and applications. Beardsley, Torr, and Zisserman (1996)
use extended feature tracking combined with structure from motion (Chapter 7) to incremen-
tally build up sparse 3D models from video sequences. Kang, Szeliski, and Shum (1997)
tie together the corners of adjacent (regularly gridded) patches to provide some additional
stability to the tracking, at the cost of poorer handling of occlusions. Tommasini, Fusiello,
Trucco et al. (1998) provide a better spurious match rejection criterion for the basic Shi and
Tomasi algorithm, Collins and Liu (2003) provide improved mechanisms for feature selec-
tion and dealing with larger appearance changes over time, and Shafique and Shah (2005)
develop algorithms for feature matching (data association) for videos with large numbers of