Page 174 - Designing Sociable Robots
P. 174
breazeal-79017 book March 18, 2002 14:7
The Behavior System 155
entrain to the robot by reading its turn-taking cues. The resulting interaction dynamics are
reminiscent of infant-caregiver exchanges. However, there are number of ways in which the
system could be improved.
The robot does not currently have the ability to interrupt itself. This will be an important
ability for more sophisticated exchanges. When watching video of people talking with
Kismet, they are quite resilient to hiccups in the flow of “conversation.” If they begin to say
something just before the robot, they will immediately pause once the robot starts speaking
and wait for the robot to finish. It would be nice if Kismet could exhibit the same courtesy.
The robot’s babbles are quite short at the moment, so this is not a serious issue yet. As the
utterances become longer, it will become more important.
It is also important for the robot to understand where the human’s attention is directed. At
the very least, the robot should have a robust way of measuring when a person is addressing
it. Currently the robot assumes that if a person is nearby, then that person is attending to
the robot. The robot also assumes that it is the most salient person who is addressing it.
Clearly this is not always the case. This is painfully evident when two people try to talk to
the robot and to each other. It would be a tremendous improvement to the current imple-
mentation if the robot would only respond when a person addressed it directly (instead of
addressing someone else) and if the robot responded to the correct person (instead of the
most salient person). Sound localization using the stereo microphones on the ears could
help identify the source of the speech signal. This information could also be correlated with
visual input to direct the robot’s gaze. In general, determining where a person is looking is
a computationally difficult problem (Newman & Zelinsky, 1998; Scassellati, 1999).
The latency in Kismet’s verbal turn-taking behavior needs to be reduced. For humans,
the average time for a verbal reply is about 250 ms. For Kismet, its verbal response time
varies from 500 ms to 1500 ms. Much of this depends on the length of the person’s previous
utterance, and the time it takes the robot to shift between turn-taking postures. In the current
implementation, the in-speech flag is set when the person begins speaking, and is cleared
when the person finishes. There is a delay of about 500 ms built into the speech recognition
system from the end of speech to accommodate pauses between phrases. Additional delays
are related to the length of the spoken utterance—the longer the utterance the more com-
putation is required before the output is produced. To alleviate awkward pauses and to give
people immediate feedback that the robot heard them, the ear-perk response is triggered by
the sound-flag. This flag is sent immediately whenever the speech recognizer receives
input (speech or non-speech sounds). Delays are also introduced as the robot shifts posture
between taking its turn and relinquishing the floor. This also sends important social cues
and enlivens the exchange. In watching the video, the turn-taking pace is certainly slower
than for conversing adults, but given the lively posturing and facial animation, it appears en-
gaging. The naive subjects readily adapted to this pace and did not seem to find it awkward.

