Page 102 - Biomimetics : Biologically Inspired Technologies

P. 102

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 88 21.9.2005 11:40pm

88 Biomimetics: Biologically Inspired Technologies

Figure 3.9 Vision cognition architecture. The raw input to the visual system is a wide-angle high-resolution video
camera (large frame shown in the lower right of the ﬁgure). A subimage, of a permanently ﬁxed size (say
1024 1024 pixels) of a single video frame (shown as a square within the large frame), termed the eyeball
image, is determined by the location of its center (depicted by the intersection of crosshairs), known as the ﬁxation
point. The gaze controller uses the entire large frame to select a single ﬁxation point, if it deems that such a
selection is warranted for this large frame (it only attempts to select a ﬁxation point when processing of the last
eyeball image has been completed). For simplicity, it is assumed that the video camera is ﬁxed and is able to see
the entire visual scene of interest (e.g., a camera viewing a busy downtown intersection). The confabulation
architecture used for visual processing is described in the text.

observer is viewing the video it is important that they be carrying out whatever speciﬁc task or
tasks that the automated vision system will be asked to carry out (e.g., spotting people, pets,
bicycles, and cars).
After many tens of hours of video have been viewed by the human observer carrying out the
function that the machine visual cognition system will later perform, and their eye movements
have been recorded, this provides a record of their ﬁxation point choices for each still frame of
speciﬁc scene content when that choice was made. This record is then used to train a multi-layer
perceptron (Hecht-Nielsen, 2004) to carry out the gaze control function. The basic idea is simple.
Each frame of high-resolution video is described by an image feature vector V. This feature vector
is produced by ﬁrst taking the inner product of each of a collection of Gabor logons with the image
frame (both considered as vectors of the same dimension). The speciﬁc Gabor logons used in
forming V (each logon is deﬁned by the constants E, F, and G, and by its position and angle of plane
rotation in the image — see Figure 3.10) are now described.
First, we create a ﬁxed rectangular set of gridpoints located at equal pixel spacings across the
entire high-resolution video camera frame (Caid and Hecht-Nielsen, 2001, 2004; Daugman, 1985,
1987, 1988a,b; Daugman and Kammen, 1987; Hecht-Nielsen, 1990; Hecht-Nielsen and Zhou,
1995). For example, if each video camera image frame were a 8,192 8,192 pixel digital image,
with a 16-bit panchromatic grayscale, or equivalently, a 67,108,864-dimensional ﬂoating point real
vector with integer components between 0 and 65,535, then we might have gridpoints spaced every
16 pixels vertically and horizontally, with gridpoints on the image edges, for a total of 513 513 ¼
263,169 gridpoints.
At each gridpoint we create a set of Gabor logons centered at that position, each having a
speciﬁed rotation angle and E, F, and G values. The set of logons at each gridpoint is exactly the
same, save for their translated position. This set, which is now described, is termed a jet (von der

97 98 99 100 101 102 103 104 105 106 107