Page 72 - Dynamic Vision for Perception and Control of Motion
P. 72
56 2 Basic Relations: Image Sequences – “the World”
represents an object or sub–object as a movable or functionally separate part. Ob-
jects may be inserted or deleted from one frame to the next (dynamic scene tree).
This scene tree represents the mapping process of features on the surface of ob-
jects in the real world up to hundreds of meters away into the image of one or more
camera(s). They finally have an extension of several pixels on the camera chip (a
few dozen micrometers with today’s technology). Their motion on the chip is to be
interpreted as body motion in the real world of the object carrying these features,
taking body motion affecting the mapping process properly into account. Since
body motions are smooth, in general, spatiotemporal embedding and first-order ap-
proximations help making visual interpretation more efficient, especially at high
image rates as in video sequences.
2.4.1 Gain by Multiple Images in Space and/or Time for Model Fitting
High–frequency temporal embedding alleviates the correspondence problem be-
tween features from one frame to the next, since they will have moved only by a
small amount. This reduces the search range in a top-down feature extraction mode
like the one used for tracking. Especially, if there are stronger, unpredictable per-
turbations, their effect on feature position is minimized by frequent measurements.
Doubling the sampling rate, for example, allows detecting a perturbation onset
much earlier (on average). Since tracking in the image has to be done in two di-
mensions, the search area may be reduced by a square effect relative to the one-
dimensional (linear) reduction in time available for evaluation. As mentioned pre-
viously for reference, humans cannot tell the correct sequence of two events if they
are less than 30 ms apart, even though they can perceive that there are two separate
events [Pöppel, Schill 1995]. Experimental experience with technical vision systems
has shown that using every frame of a 25 Hz image sequence (40 ms cycle time)
allows object tracking of high quality if proper feature extraction algorithms to
subpixel accuracy and well-tuned recursive estimation processes are applied. This
tuning has to be adapted by knowledge components taking the situation of driving
a vehicle and the lighting conditions into account.
This does not include, however, that all processing on the higher levels has to
stick to this high rate. Maneuver recognition of other subjects, situation assess-
ment, and behavior decision for locomotion can be performed on a (much) lower
scale without sacrificing quality of performance, in general. This may partly be due
to the biological nature of humans. It is almost impossible for humans to react in
less than several hundred milliseconds response time. As mentioned before, the
unit “second” may have been chosen as the basic timescale for this reason.
However, high image rates provide the opportunity both for early detection of
events and for data smoothing on the timescale with regard to motion processes of
interest. Human extremities like arms or legs can hardly be activated at more than
2 Hz corner frequency. Therefore, efficient vision systems should concentrate
computing resources to where information can be gained best (at expected feature
locations of known objects/subjects of interest) and to regions where new objects
may occur. Foveal–peripheral differentiation of spatial resolution in connection
with fast gaze control may be considered an optimal vision system design found in