Page 408 -
P. 408

Section 12.2  Model-based Vision: Registering Rigid Objects with Projection  376


                            orientation, and scale of a known object in an image with respect to the camera,
                            despite some uncertainty about which image features lie on the object. Such algo-
                            rithms can be extremely useful in systems that must interact with the world. For
                            example, if we wished to move an object into a particular position or grasp it, it
                            could be really useful to know its configuration with respect to the camera. We use
                            the same strategy for this problem that we used for registering 3D objects to 3D
                            objects, that is, repeatedly: find a group; recover the transformation; apply this to
                            the whole source; and score the similarity between the source and the target. At
                            the end, we report the transformation with the best score. Furthermore, if the best
                            available transformation score is good, then the object is there; if it is bad, then it
                            isn’t.
                                 The source S now consists of tokens on some geometric structure, and T is
                            the image (in one or another kind of camera) of another set of tokens on a rotated,
                            translated, and scaled version of that structure. We would like to determine the
                            rotation, translation, and scale applied. Usually this problem involves a significant
                            number of outliers in T , which occur because we don’t know which image features
                            actually came from the object. Almost always the tokens are points or lines; for
                            S, these are determined from a geometric model of the object, and for T ,these
                            come from edge points or fitting lines to edge points (we could use the machinery
                            of Chapter 10 to get these lines). This case has two distinctive features. We might
                            not be able to estimate all transform parameters (which typically won’t matter all
                            that much), and it can be quite difficult to come up with a satisfactory score of
                            similarity between the source and the target.
                                 There are numerous ways of estimating transform parameters. The details
                            depend on whether we need to calibrate the camera, and on what camera model we
                            impose. In the simplest case, assume we have an orthographic camera, calibrated
                            up to unknown camera scale, looking down the z axis in the camera coordinate
                            system. Then we cannot determine depth to the 3D object, because changing the
                            depth does not change the image. We cannot determine the scale of the object
                            separate from the scale of the camera, because by changing these two parameters
                            together we can fix the image. For example, if we double the size of the object,
                            and also halve the size of the camera units, then the image points will have the
                            same coordinate values. However, this doesn’t affect the reasoning behind the
                            search processes described above. For example, if we build the right correspondence
                            between source and target group, then visible source tokens should end up close to
                            or on top of target tokens. This means that a RANSAC-style approach applies, as
                            above. Similarly, if we represent the transformation parameters appropriately (we
                            could set the camera scale arbitrarily to one), we could vote.
                                 In the case of a single orthographic camera, calibrated up to unknown camera
                            scale, correspondences between three points are enough to estimate rotation, the
                            two observable components of translation, and scale (see the exercises, which also
                            give other frame groups). In most applications, the range of depths across the
                            object is small compared to the depth to the object. In turn, this means that a
                            perspective camera can be approximated with the weak perspective approximation
                            of Section 1.1.2. This is equivalent to a single orthographic camera, calibrated up
                            to unknown camera scale. If the scale of the camera is known, then it is possible
                            to recover depth to the object as well.
   403   404   405   406   407   408   409   410   411   412   413