Page 305 -
P. 305
284 6 Feature-based alignment
6.2 Pose estimation
A particular instance of feature-based alignment, which occurs very often, is estimating an
object’s 3D pose from a set of 2D point projections. This pose estimation problem is also
known as extrinsic calibration, as opposed to the intrinsic calibration of internal camera pa-
rameters such as focal length, which we discuss in Section 6.3. The problem of recovering
pose from three correspondences, which is the minimal amount of information necessary,
is known as the perspective-3-point-problem (P3P), with extensions to larger numbers of
points collectively known as PnP (Haralick, Lee, Ottenberg et al. 1994; Quan and Lan 1999;
Moreno-Noguer, Lepetit, and Fua 2007).
In this section, we look at some of the techniques that have been developed to solve such
problems, starting with the direct linear transform (DLT), which recovers a 3×4 camera ma-
trix, followed by other “linear” algorithms, and then looking at statistically optimal iterative
algorithms.
6.2.1 Linear algorithms
The simplest way to recover the pose of the camera is to form a set of linear equations analo-
gous to those used for 2D motion estimation (6.19) from the camera matrix form of perspec-
tive projection (2.55–2.56),
p 00 X i + p 01 Y i + p 02 Z i + p 03
= (6.33)
x i
p 20 X i + p 21 Y i + p 22 Z i + p 23
p 10 X i + p 11 Y i + p 12 Z i + p 13
= , (6.34)
y i
p 20 X i + p 21 Y i + p 22 Z i + p 23
where (x i ,y i ) are the measured 2D feature locations and (X i ,Y i ,Z i ) are the known 3D
feature locations (Figure 6.4). As with (6.21), this system of equations can be solved in a
linear fashion for the unknowns in the camera matrix P by multiplying the denominator on
both sides of the equation. 9 The resulting algorithm is called the direct linear transform
(DLT) and is commonly attributed to Sutherland (1974). (For a more in-depth discussion,
refer to the work of Hartley and Zisserman (2004).) In order to compute the 12 (or 11)
unknowns in P , at least six correspondences between 3D and 2D locations must be known.
As with the case of estimating homographies (6.21–6.23), more accurate results for the
entries in P can be obtained by directly minimizing the set of Equations (6.33–6.34) using
non-linear least squares with a small number of iterations.
Once the entries in P have been recovered, it is possible to recover both the intrinsic
calibration matrix K and the rigid transformation (R, t) by observing from Equation (2.56)
that
P = K[R|t]. (6.35)
Since K is by convention upper-triangular (see the discussion in Section 2.1.5), both K and
R can be obtained from the front 3 × 3 sub-matrix of P using RQ factorization (Golub and
Van Loan 1996). 10
9
Because P is unknown up to a scale, we can either fix one of the entries, e.g., p 23 =1, or find the smallest
singular vector of the set of linear equations.
10 Note the unfortunate clash of terminologies: In matrix algebra textbooks, R represents an upper-triangular
matrix; in computer vision, R is an orthogonal rotation.