Page 71 - Dynamic Vision for Perception and Control of Motion

P. 71

2.4 Spatiotemporal Embedding and First-order Approximations 55

Therefore, the general task of real-time vision is to achieve a compact internal rep-
resentation of motion processes of several objects observed in parallel by evaluat-
ing feature flows in the image sequence. Since egomotion also enters the content of
images, the state of the vehicle carrying the cameras has to be observed simultane-
ously. However, vision gives information on relative motion only between objects,
unfortunately, in addition, with appreciable time delay (several tenths of a second)
and no immediate correlation to inertial space. Therefore, conventional sensors on
the body yielding relative motion to the stationary environment (like odometers) or
inertial accelerations and rotational rates (from inertial sensors like accelerometers
and angular rate sensors) are very valuable for perceiving egomotion and for telling
this apart from the visual effects of motion of other objects. Inertial sensors have
the additional advantage of picking up perturbation effects from the environment
before they show up as unexpected deviations in the integrals (speed components
and pose changes). All these measurements with differing delay times and trust
values have to be interpreted in conjunction to arrive at a consistent interpretation
of the situation for making decisions on appropriate behavior.
Before this can be achieved, perceptual and behavioral capabilities have to be
defined and represented (Chapters 3 to 6). Road recognition as indicated in Figures
2.7 and 2.9 while driving on the road will be the application area in Chapters 7 to
10. The approach is similar to the human one: Driven by the optical input from the
image sequence, an internal animation process in 3-D space and time is started
with members of generically known object and subject classes that are to duplicate
the visual appearance of “the world” by prediction-error feedback. For the next
time for measurement taking (corrected for time delay effects), the expected values
in each measurement modality are predicted. The prediction errors are then used to
improve the internal state representation, taking the Jacobian matrices and the con-
fidence in the models for the motion processes as well as for the measurement
processes involved into account (error covariance matrices).
For vision, the concatenation process with HCTs for each object-sensor pair
(Figure 2.7) as part of the physical world provides the means for achieving our
goal of understanding dynamic processes in an integrated approach. Since the
analysis of the next image of a sequence should take advantage of all information
collected up to this time, temporal prediction is performed based on the actual best
estimates available for all objects involved and based on the dynamic models as
discussed. Note that no storage of image data is required in this approach, but only
the parameters and state variables of those objects instantiated need be stored to
represent the scene observed; usually, this reduces storage requirements by several
orders of magnitude.
Figure 2.9 showed a road scene with one vehicle on a curved road (upper right)
in the viewing range of the egovehicle (left); the connecting object is the curved
road with several lanes, in general. The mounting conditions for the camera in the
vehicle (lower left) on a platform are shown in an exploded view on top for clarity.
The coordinate systems define the different locations and aspect conditions for ob-
ject mapping. The trouble in vision (as opposed to computer graphics) is that the
entries in most of the HCT-matrices are the unknowns of the vision problem (rela-
tive distances and angles). In a tree representation of this arrangement of objects
(Figure 2.7), each edge between circles represents an HCT and each node (circle)

66 67 68 69 70 71 72 73 74 75 76