Page 37 - Dynamic Vision for Perception and Control of Motion
P. 37

2  Basic Relations: Image Sequences –

            “the World”










            Vision is a process in which temporally changing intensity and color values in the
            image plane have to be interpreted as processes in the real world that happen in 3-
            D space over time. Each image of today’s TV cameras contains about half a mil-
            lion pixels. Twenty five (or thirty) of these images are taken per second. This high
            image frame rate has been chosen to induce the impression of steady and continu-
            ous motion in human observers. If each image were completely different from the
            others, as in a slide show with snapshots from scenes taken far apart in time and
            space, and were displayed at normal video rate as a film, nobody would understand
            what is being shown. The continuous development of action that makes films un-
            derstandable is missing.
              This should make clear that it is not the content of each single image, which
            constitutes the information conveyed to the observer, but the relatively slow devel-
            opment of motion and of action over time. The common unit of 1 second defines
            the temporal resolution most adequate for human understanding. Thus, relatively
            slow moving objects and slow acting subjects are the essential carriers of informa-
            tion in this framework. A bullet flying through the scene can be perceived only by
            the effect it has on other objects or subjects. Therefore, the capability of visual per-
            ception is based on the ability to generate internal representations of temporal proc-
            esses in 3-D space and time with objects and subjects (synthesis), which are sup-
            ported  by feature  flows  from  image sequences  (analysis). This is an animation
            process with generically known elements; both parameters defining the actual 3-D
            shape and the time history of the state variables of objects observed have to be de-
            termined from vision.
              In this “analysis by synthesis” procedure chosen in the 4-D approach to dynamic
            vision, the internal representations in the interpretation process have  four inde-
            pendent variables: three orthogonal space components (3-D space) and time. For
            common tasks in our natural (mesoscale, that is not too small and not too large)
            environment, these  variables are  known to  be sufficiently representative in the
            classical nonrelativistic sense.
              As mentioned in the introduction, fast image sequences contain quite a bit of re-
            dundancy, since only small changes occur from one frame to the next, in general;
            massive bodies show continuity in their motion. The characteristic frequencies of
            human and most animal motion are less than a few oscillations per second (Hz), so
            that at video rate, at least a dozen image frames are taken per oscillation period.
            According to sampled data theory, this allows good recognition of the dynamic pa-
            rameters in frequency space (time constants, eigenfrequencies, and damping). So,
            the task of visual dynamic scene understanding can be described as follows:
   32   33   34   35   36   37   38   39   40   41   42