Page 104 - Biomimetics : Biologically Inspired Technologies
P. 104

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 90 21.9.2005 11:40pm




                    90                                      Biomimetics: Biologically Inspired Technologies

                    the squares of the sine and cosine inner products of the logons of the same scale and rotational
                    orientation in each jet (which reduces the total dimensionality of V to half that of the total
                    number of logons). (Note: Other mathematical transformations are then applied to each
                    of these sums to make their values insensitive to lighting gradient slopes and other lighting-
                    dependent effects — but these details go beyond the scope of this sketch and so are left out —
                    see Hecht-Nielsen and Zhou, 1995 for examples of such transformations.)
                       Each component of V essentially represents an estimate of the localized spatial frequency
                    content of the camera image (at the position of the associated gridpoint) at the spatial frequency
                    of the involved logon pair, in the direction of oscillation of that pair. It is on the basis of local spatial
                    frequency structure (which V accurately defines) that fixation points are chosen by the gaze
                    controller.
                       The job of the gaze controller is to learn to mimic the performance of a skilled human observer
                    performing the visual task that is to be mechanized. The manner in which the gaze controller works
                    and the method used to train it are now described.
                       The gaze controller (a perceptron; Hecht-Nielsen, 2004) has 224 inputs and two outputs. The
                    inputs represent the components of V corresponding to the jet at a particular image gridpoint (the
                    current position of regard of the gaze controller). The outputs of the gaze controller are estimates of
                    the a posteriori probability of this gridpoint being chosen by the skilled human as a fixation point
                    along with the a posteriori probability of this gridpoint not being chosen by the skilled human as a
                    fixation point. Training of the gaze controller is discussed below; but, to set the stage, the manner in
                    which the gaze controller is used operationally is described first.
                       Once trained, the gaze controller is used to select a fixation point in a newly acquired video
                    frame by evaluating each of the V component sets from each of the 263,169 gridpoints of the frame.
                    If the first output of the controller is above a fixed threshold (say, 0.8), and the second output is
                    below a fixed threshold (say, 0.2), then that gridpoint is selected as a candidate fixation point.If
                    there are no candidate fixation points for the frame, then that frame is skipped. If there are one or
                    more, the one with the highest first output value is selected as the fixation point. The gaze controller
                    also has provisions for creating multiple successive ‘‘looks’’ at the same object during visual
                    training to facilitate learning of pose insensitivity (see below). In operational use, when a visual
                    object of interest has been fixated on and described, the gaze controller tracks that object’s fixation
                    points and prevents return to it until the other visual objects of interest in the scene have been
                    described.
                       To train the gaze controller, each fixation point example (for which a reference frame is selected
                    as the definitive ‘‘image input’’ that the human used — by taking a frame a fixed time increment
                    right before the beginning of their saccade) has its pixel coordinates (supplied by the frequently-
                    recalibrated eye tracker) stored with its reference frame. Eventually, many thousands of such
                    fixation point and reference frame pairs are produced, randomly scrambled to remove possible
                    content correlations between them, and stored. The V vector for each reference frame is also
                    calculated and stored with it.
                       The gaze controller perceptron is trained by marching through the fixation point or reference
                    frame examples, in sequence, many times. At each training episode, the next fixation point and
                    reference frame example in sequence is selected and the gridpoint nearest to the fixation point is
                    located. The jet components of the reference frame V vector for that gridpoint are then extracted
                    and provided to the perceptron, along with desired outputs 1 and 0, and one backpropagation
                    training episode using these specified inputs and outputs is carried out. Another gridpoint, distant
                    from the fixation point, is then selected and its jet V components are provided to the perceptron,
                    along with desired outputs 0 and 1, and a second perceptron training episode is carried out using
                    these inputs and outputs. The training process then moves on to the next fixation point or reference
                    image example. Thus, this training procedure beneficially utilizes oversampling of the examples of
                    the class of human-supplied fixation points (Hecht-Nielsen, 2004).
   99   100   101   102   103   104   105   106   107   108   109