Page 82 - Designing Sociable Robots
P. 82

breazeal-79017  book  March 18, 2002  14:2





                       The Vision System                                                     63





                       influence provides Kismet with a primitive attention span. For Kismet, the second stage
                       includes an eye-detector that operates over the foveal image, and a target proximity esti-
                       mator that operates on the stereo images of the two central wide field of view (FoV)
                       cameras.
                         Four factors (pre-attentive processing, post-attentive processing, task-driven influences,
                       and habituation) influence the direction of Kismet’s gaze. This in turn determines the robot’s
                       subsequent perception, which ultimately feeds back to behavior. Hence, the robot is in
                       a continuous cycle: behavior influencing what is perceived, and perception influencing
                       subsequent behavior.
                       Bottom-up Contributions: Computing Feature Maps

                       The purpose of the first massively parallel stage is to identify locations that are worthy
                       of further attention. This is considered to be a bottom-up or stimulus-driven contribution.
                       Raw sensory saliency cues are equivalent to those “pop-out” effects studied by Triesman
                       (1986), such as color intensity, motion, and orientation for visual stimuli. As such, it serves
                       to bias attention toward distinctive items in the visual field and will not guide attention if
                       the properties of that item are not inherently salient.
                         This contribution is computed from a series of feature maps, which are updated in parallel
                       over the entire visual field (of the wide FoV camera) for a limited set of basic visual features.
                       There is a separate feature map for each basic feature (for Kismet these correspond to
                       color, motion, and skin tone), and each map is topographically organized and in retinotopic
                       coordinates. The computation of these maps is described below. The value of each location is
                       called the activation level and represents the saliency of that location in the visual field with
                       respect to the other locations. In this implementation, the overall bottom-up contribution
                       comes from combining the results of these feature maps in a weighted sum.
                         The video signal from each of Kismet’s cameras is digitized by one of the 400 MHz
                       nodes with frame-grabbing hardware. The image is then subsampled and averaged to an
                       appropriate size. Currently, we use an image size of 128 × 128, which allows us to com-
                       plete all of the processing in near real-time. To minimize latency, each feature map is
                       computed by a separate 400 MHz processor (each of which also has additional com-
                       putational task load). All of the feature detectors discussed here can operate at multiple
                       scales.

                       Color saliency feature map One of the most basic and widely recognized visual features
                       is color. These models of color saliency are drawn from the complementary work on visual
                       search and attention (Itti et al., 1998). The incoming video stream contains three 8-bit color
                       channels (r for red, g for green, and b for blue) each with a 0 to 255 value range that are



                       transformed into four color-opponent channels (r , g , b , and y ). Each input color channel
   77   78   79   80   81   82   83   84   85   86   87