Page 82 - Designing Sociable Robots
P. 82
breazeal-79017 book March 18, 2002 14:2
The Vision System 63
influence provides Kismet with a primitive attention span. For Kismet, the second stage
includes an eye-detector that operates over the foveal image, and a target proximity esti-
mator that operates on the stereo images of the two central wide field of view (FoV)
cameras.
Four factors (pre-attentive processing, post-attentive processing, task-driven influences,
and habituation) influence the direction of Kismet’s gaze. This in turn determines the robot’s
subsequent perception, which ultimately feeds back to behavior. Hence, the robot is in
a continuous cycle: behavior influencing what is perceived, and perception influencing
subsequent behavior.
Bottom-up Contributions: Computing Feature Maps
The purpose of the first massively parallel stage is to identify locations that are worthy
of further attention. This is considered to be a bottom-up or stimulus-driven contribution.
Raw sensory saliency cues are equivalent to those “pop-out” effects studied by Triesman
(1986), such as color intensity, motion, and orientation for visual stimuli. As such, it serves
to bias attention toward distinctive items in the visual field and will not guide attention if
the properties of that item are not inherently salient.
This contribution is computed from a series of feature maps, which are updated in parallel
over the entire visual field (of the wide FoV camera) for a limited set of basic visual features.
There is a separate feature map for each basic feature (for Kismet these correspond to
color, motion, and skin tone), and each map is topographically organized and in retinotopic
coordinates. The computation of these maps is described below. The value of each location is
called the activation level and represents the saliency of that location in the visual field with
respect to the other locations. In this implementation, the overall bottom-up contribution
comes from combining the results of these feature maps in a weighted sum.
The video signal from each of Kismet’s cameras is digitized by one of the 400 MHz
nodes with frame-grabbing hardware. The image is then subsampled and averaged to an
appropriate size. Currently, we use an image size of 128 × 128, which allows us to com-
plete all of the processing in near real-time. To minimize latency, each feature map is
computed by a separate 400 MHz processor (each of which also has additional com-
putational task load). All of the feature detectors discussed here can operate at multiple
scales.
Color saliency feature map One of the most basic and widely recognized visual features
is color. These models of color saliency are drawn from the complementary work on visual
search and attention (Itti et al., 1998). The incoming video stream contains three 8-bit color
channels (r for red, g for green, and b for blue) each with a 0 to 255 value range that are
transformed into four color-opponent channels (r , g , b , and y ). Each input color channel

