Page 408 -
P. 408
Section 12.2 Model-based Vision: Registering Rigid Objects with Projection 376
orientation, and scale of a known object in an image with respect to the camera,
despite some uncertainty about which image features lie on the object. Such algo-
rithms can be extremely useful in systems that must interact with the world. For
example, if we wished to move an object into a particular position or grasp it, it
could be really useful to know its configuration with respect to the camera. We use
the same strategy for this problem that we used for registering 3D objects to 3D
objects, that is, repeatedly: find a group; recover the transformation; apply this to
the whole source; and score the similarity between the source and the target. At
the end, we report the transformation with the best score. Furthermore, if the best
available transformation score is good, then the object is there; if it is bad, then it
isn’t.
The source S now consists of tokens on some geometric structure, and T is
the image (in one or another kind of camera) of another set of tokens on a rotated,
translated, and scaled version of that structure. We would like to determine the
rotation, translation, and scale applied. Usually this problem involves a significant
number of outliers in T , which occur because we don’t know which image features
actually came from the object. Almost always the tokens are points or lines; for
S, these are determined from a geometric model of the object, and for T ,these
come from edge points or fitting lines to edge points (we could use the machinery
of Chapter 10 to get these lines). This case has two distinctive features. We might
not be able to estimate all transform parameters (which typically won’t matter all
that much), and it can be quite difficult to come up with a satisfactory score of
similarity between the source and the target.
There are numerous ways of estimating transform parameters. The details
depend on whether we need to calibrate the camera, and on what camera model we
impose. In the simplest case, assume we have an orthographic camera, calibrated
up to unknown camera scale, looking down the z axis in the camera coordinate
system. Then we cannot determine depth to the 3D object, because changing the
depth does not change the image. We cannot determine the scale of the object
separate from the scale of the camera, because by changing these two parameters
together we can fix the image. For example, if we double the size of the object,
and also halve the size of the camera units, then the image points will have the
same coordinate values. However, this doesn’t affect the reasoning behind the
search processes described above. For example, if we build the right correspondence
between source and target group, then visible source tokens should end up close to
or on top of target tokens. This means that a RANSAC-style approach applies, as
above. Similarly, if we represent the transformation parameters appropriately (we
could set the camera scale arbitrarily to one), we could vote.
In the case of a single orthographic camera, calibrated up to unknown camera
scale, correspondences between three points are enough to estimate rotation, the
two observable components of translation, and scale (see the exercises, which also
give other frame groups). In most applications, the range of depths across the
object is small compared to the depth to the object. In turn, this means that a
perspective camera can be approximated with the weak perspective approximation
of Section 1.1.2. This is equivalent to a single orthographic camera, calibrated up
to unknown camera scale. If the scale of the camera is known, then it is possible
to recover depth to the object as well.

