Page 104 - Biomimetics : Biologically Inspired Technologies

P. 104

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 90 21.9.2005 11:40pm

90 Biomimetics: Biologically Inspired Technologies

the squares of the sine and cosine inner products of the logons of the same scale and rotational
orientation in each jet (which reduces the total dimensionality of V to half that of the total
number of logons). (Note: Other mathematical transformations are then applied to each
of these sums to make their values insensitive to lighting gradient slopes and other lighting-
dependent effects — but these details go beyond the scope of this sketch and so are left out —
see Hecht-Nielsen and Zhou, 1995 for examples of such transformations.)
Each component of V essentially represents an estimate of the localized spatial frequency
content of the camera image (at the position of the associated gridpoint) at the spatial frequency
of the involved logon pair, in the direction of oscillation of that pair. It is on the basis of local spatial
frequency structure (which V accurately deﬁnes) that ﬁxation points are chosen by the gaze
controller.
The job of the gaze controller is to learn to mimic the performance of a skilled human observer
performing the visual task that is to be mechanized. The manner in which the gaze controller works
and the method used to train it are now described.
The gaze controller (a perceptron; Hecht-Nielsen, 2004) has 224 inputs and two outputs. The
inputs represent the components of V corresponding to the jet at a particular image gridpoint (the
current position of regard of the gaze controller). The outputs of the gaze controller are estimates of
the a posteriori probability of this gridpoint being chosen by the skilled human as a ﬁxation point
along with the a posteriori probability of this gridpoint not being chosen by the skilled human as a
ﬁxation point. Training of the gaze controller is discussed below; but, to set the stage, the manner in
which the gaze controller is used operationally is described ﬁrst.
Once trained, the gaze controller is used to select a ﬁxation point in a newly acquired video
frame by evaluating each of the V component sets from each of the 263,169 gridpoints of the frame.
If the ﬁrst output of the controller is above a ﬁxed threshold (say, 0.8), and the second output is
below a ﬁxed threshold (say, 0.2), then that gridpoint is selected as a candidate ﬁxation point.If
there are no candidate ﬁxation points for the frame, then that frame is skipped. If there are one or
more, the one with the highest ﬁrst output value is selected as the ﬁxation point. The gaze controller
also has provisions for creating multiple successive ‘‘looks’’ at the same object during visual
training to facilitate learning of pose insensitivity (see below). In operational use, when a visual
object of interest has been ﬁxated on and described, the gaze controller tracks that object’s ﬁxation
points and prevents return to it until the other visual objects of interest in the scene have been
described.
To train the gaze controller, each ﬁxation point example (for which a reference frame is selected
as the deﬁnitive ‘‘image input’’ that the human used — by taking a frame a ﬁxed time increment
right before the beginning of their saccade) has its pixel coordinates (supplied by the frequently-
recalibrated eye tracker) stored with its reference frame. Eventually, many thousands of such
ﬁxation point and reference frame pairs are produced, randomly scrambled to remove possible
content correlations between them, and stored. The V vector for each reference frame is also
calculated and stored with it.
The gaze controller perceptron is trained by marching through the ﬁxation point or reference
frame examples, in sequence, many times. At each training episode, the next ﬁxation point and
reference frame example in sequence is selected and the gridpoint nearest to the ﬁxation point is
located. The jet components of the reference frame V vector for that gridpoint are then extracted
and provided to the perceptron, along with desired outputs 1 and 0, and one backpropagation
training episode using these speciﬁed inputs and outputs is carried out. Another gridpoint, distant
from the ﬁxation point, is then selected and its jet V components are provided to the perceptron,
along with desired outputs 0 and 1, and a second perceptron training episode is carried out using
these inputs and outputs. The training process then moves on to the next ﬁxation point or reference
image example. Thus, this training procedure beneﬁcially utilizes oversampling of the examples of
the class of human-supplied ﬁxation points (Hecht-Nielsen, 2004).

99 100 101 102 103 104 105 106 107 108 109