Page 37 - Dynamic Vision for Perception and Control of Motion
P. 37
2 Basic Relations: Image Sequences –
“the World”
Vision is a process in which temporally changing intensity and color values in the
image plane have to be interpreted as processes in the real world that happen in 3-
D space over time. Each image of today’s TV cameras contains about half a mil-
lion pixels. Twenty five (or thirty) of these images are taken per second. This high
image frame rate has been chosen to induce the impression of steady and continu-
ous motion in human observers. If each image were completely different from the
others, as in a slide show with snapshots from scenes taken far apart in time and
space, and were displayed at normal video rate as a film, nobody would understand
what is being shown. The continuous development of action that makes films un-
derstandable is missing.
This should make clear that it is not the content of each single image, which
constitutes the information conveyed to the observer, but the relatively slow devel-
opment of motion and of action over time. The common unit of 1 second defines
the temporal resolution most adequate for human understanding. Thus, relatively
slow moving objects and slow acting subjects are the essential carriers of informa-
tion in this framework. A bullet flying through the scene can be perceived only by
the effect it has on other objects or subjects. Therefore, the capability of visual per-
ception is based on the ability to generate internal representations of temporal proc-
esses in 3-D space and time with objects and subjects (synthesis), which are sup-
ported by feature flows from image sequences (analysis). This is an animation
process with generically known elements; both parameters defining the actual 3-D
shape and the time history of the state variables of objects observed have to be de-
termined from vision.
In this “analysis by synthesis” procedure chosen in the 4-D approach to dynamic
vision, the internal representations in the interpretation process have four inde-
pendent variables: three orthogonal space components (3-D space) and time. For
common tasks in our natural (mesoscale, that is not too small and not too large)
environment, these variables are known to be sufficiently representative in the
classical nonrelativistic sense.
As mentioned in the introduction, fast image sequences contain quite a bit of re-
dundancy, since only small changes occur from one frame to the next, in general;
massive bodies show continuity in their motion. The characteristic frequencies of
human and most animal motion are less than a few oscillations per second (Hz), so
that at video rate, at least a dozen image frames are taken per oscillation period.
According to sampled data theory, this allows good recognition of the dynamic pa-
rameters in frequency space (time constants, eigenfrequencies, and damping). So,
the task of visual dynamic scene understanding can be described as follows: