Page 305 -
P. 305

284                                                                6 Feature-based alignment


                                6.2 Pose estimation

                                A particular instance of feature-based alignment, which occurs very often, is estimating an
                                object’s 3D pose from a set of 2D point projections. This pose estimation problem is also
                                known as extrinsic calibration, as opposed to the intrinsic calibration of internal camera pa-
                                rameters such as focal length, which we discuss in Section 6.3. The problem of recovering
                                pose from three correspondences, which is the minimal amount of information necessary,
                                is known as the perspective-3-point-problem (P3P), with extensions to larger numbers of
                                points collectively known as PnP (Haralick, Lee, Ottenberg et al. 1994; Quan and Lan 1999;
                                Moreno-Noguer, Lepetit, and Fua 2007).
                                   In this section, we look at some of the techniques that have been developed to solve such
                                problems, starting with the direct linear transform (DLT), which recovers a 3×4 camera ma-
                                trix, followed by other “linear” algorithms, and then looking at statistically optimal iterative
                                algorithms.


                                6.2.1 Linear algorithms
                                The simplest way to recover the pose of the camera is to form a set of linear equations analo-
                                gous to those used for 2D motion estimation (6.19) from the camera matrix form of perspec-
                                tive projection (2.55–2.56),

                                                            p 00 X i + p 01 Y i + p 02 Z i + p 03
                                                        =                                            (6.33)
                                                    x i
                                                            p 20 X i + p 21 Y i + p 22 Z i + p 23
                                                            p 10 X i + p 11 Y i + p 12 Z i + p 13
                                                        =                           ,                (6.34)
                                                    y i
                                                            p 20 X i + p 21 Y i + p 22 Z i + p 23
                                where (x i ,y i ) are the measured 2D feature locations and (X i ,Y i ,Z i ) are the known 3D
                                feature locations (Figure 6.4). As with (6.21), this system of equations can be solved in a
                                linear fashion for the unknowns in the camera matrix P by multiplying the denominator on
                                both sides of the equation. 9  The resulting algorithm is called the direct linear transform
                                (DLT) and is commonly attributed to Sutherland (1974). (For a more in-depth discussion,
                                refer to the work of Hartley and Zisserman (2004).) In order to compute the 12 (or 11)
                                unknowns in P , at least six correspondences between 3D and 2D locations must be known.
                                   As with the case of estimating homographies (6.21–6.23), more accurate results for the
                                entries in P can be obtained by directly minimizing the set of Equations (6.33–6.34) using
                                non-linear least squares with a small number of iterations.
                                   Once the entries in P have been recovered, it is possible to recover both the intrinsic
                                calibration matrix K and the rigid transformation (R, t) by observing from Equation (2.56)
                                that
                                                               P = K[R|t].                           (6.35)
                                Since K is by convention upper-triangular (see the discussion in Section 2.1.5), both K and
                                R can be obtained from the front 3 × 3 sub-matrix of P using RQ factorization (Golub and
                                Van Loan 1996). 10
                                  9
                                   Because P is unknown up to a scale, we can either fix one of the entries, e.g., p 23 =1, or find the smallest
                                singular vector of the set of linear equations.
                                 10  Note the unfortunate clash of terminologies: In matrix algebra textbooks, R represents an upper-triangular
                                matrix; in computer vision, R is an orthogonal rotation.
   300   301   302   303   304   305   306   307   308   309   310