Page 65 -
P. 65

44                                                                        2 Image formation


                                orientation) can be estimated using simple least squares (Section 6.2.1). Under orthography,
                                structure and motion can simultaneously be estimated using factorization (singular value de-
                                composition), as discussed in Section 7.3 (Tomasi and Kanade 1992).
                                   A closely related projection model is para-perspective (Aloimonos 1990; Poelman and
                                Kanade 1997). In this model, object points are again first projected onto a local reference
                                parallel to the image plane. However, rather than being projected orthogonally to this plane,
                                they are projected parallel to the line of sight to the object center (Figure 2.7d). This is
                                followed by the usual projection onto the final image plane, which again amounts to a scaling.
                                The combination of these two projections is therefore affine and can be written as

                                                           ⎡                   ⎤
                                                             a 00  a 01  a 02  a 03
                                                       ˜ x =  ⎣  a 10  a 11  a 12  a 13  ⎦  ˜ p.     (2.49)
                                                              0    0   0    1

                                Note how parallel lines in 3D remain parallel after projection in Figure 2.7b–d. Para-perspective
                                provides a more accurate projection model than scaled orthography, without incurring the
                                added complexity of per-pixel perspective division, which invalidates traditional factoriza-
                                tion methods (Poelman and Kanade 1997).

                                Perspective
                                The most commonly used projection in computer graphics and computer vision is true 3D
                                perspective (Figure 2.7e). Here, points are projected onto the image plane by dividing them
                                by their z component. Using inhomogeneous coordinates, this can be written as
                                                                        x/z
                                                                      ⎡     ⎤
                                                          ¯ x = P z (p)=  ⎣  y/z  ⎦  .               (2.50)
                                                                         1

                                In homogeneous coordinates, the projection has a simple linear form,

                                                                1000
                                                              ⎡            ⎤
                                                          ˜ x =  ⎣  0100   ⎦  ˜ p,                   (2.51)
                                                                0010
                                i.e., we drop the w component of p. Thus, after projection, it is not possible to recover the
                                distance of the 3D point from the image, which makes sense for a 2D imaging sensor.
                                   A form often seen in computer graphics systems is a two-step projection that first projects
                                3D coordinates into normalized device coordinates in the range (x, y, z) ∈ [−1, −1] ×
                                [−1, 1] × [0, 1], and then rescales these coordinates to integer pixel coordinates using a view-
                                port transformation (Watt 1995; OpenGL-ARB 1997).  The (initial) perspective projection
                                is then represented using a 4 × 4 matrix

                                                      10        0             0
                                                   ⎡                                  ⎤
                                                      01        0             0
                                                   ⎢                                  ⎥
                                                ˜ x =  ⎢                              ⎥  ˜ p,        (2.52)
                                                   ⎣  00    −z far /z range  z near z far /z range  ⎦
                                                      00        1             0
                                where z near and z far are the near and far z clipping planes and z range = z far − z near . Note
                                that the first two rows are actually scaled by the focal length and the aspect ratio so that
   60   61   62   63   64   65   66   67   68   69   70