Page 174 - Designing Sociable Robots
P. 174

breazeal-79017  book  March 18, 2002  14:7





                       The Behavior System                                                  155





                       entrain to the robot by reading its turn-taking cues. The resulting interaction dynamics are
                       reminiscent of infant-caregiver exchanges. However, there are number of ways in which the
                       system could be improved.
                         The robot does not currently have the ability to interrupt itself. This will be an important
                       ability for more sophisticated exchanges. When watching video of people talking with
                       Kismet, they are quite resilient to hiccups in the flow of “conversation.” If they begin to say
                       something just before the robot, they will immediately pause once the robot starts speaking
                       and wait for the robot to finish. It would be nice if Kismet could exhibit the same courtesy.
                       The robot’s babbles are quite short at the moment, so this is not a serious issue yet. As the
                       utterances become longer, it will become more important.
                         It is also important for the robot to understand where the human’s attention is directed. At
                       the very least, the robot should have a robust way of measuring when a person is addressing
                       it. Currently the robot assumes that if a person is nearby, then that person is attending to
                       the robot. The robot also assumes that it is the most salient person who is addressing it.
                       Clearly this is not always the case. This is painfully evident when two people try to talk to
                       the robot and to each other. It would be a tremendous improvement to the current imple-
                       mentation if the robot would only respond when a person addressed it directly (instead of
                       addressing someone else) and if the robot responded to the correct person (instead of the
                       most salient person). Sound localization using the stereo microphones on the ears could
                       help identify the source of the speech signal. This information could also be correlated with
                       visual input to direct the robot’s gaze. In general, determining where a person is looking is
                       a computationally difficult problem (Newman & Zelinsky, 1998; Scassellati, 1999).
                         The latency in Kismet’s verbal turn-taking behavior needs to be reduced. For humans,
                       the average time for a verbal reply is about 250 ms. For Kismet, its verbal response time
                       varies from 500 ms to 1500 ms. Much of this depends on the length of the person’s previous
                       utterance, and the time it takes the robot to shift between turn-taking postures. In the current
                       implementation, the in-speech flag is set when the person begins speaking, and is cleared
                       when the person finishes. There is a delay of about 500 ms built into the speech recognition
                       system from the end of speech to accommodate pauses between phrases. Additional delays
                       are related to the length of the spoken utterance—the longer the utterance the more com-
                       putation is required before the output is produced. To alleviate awkward pauses and to give
                       people immediate feedback that the robot heard them, the ear-perk response is triggered by
                       the sound-flag. This flag is sent immediately whenever the speech recognizer receives
                       input (speech or non-speech sounds). Delays are also introduced as the robot shifts posture
                       between taking its turn and relinquishing the floor. This also sends important social cues
                       and enlivens the exchange. In watching the video, the turn-taking pace is certainly slower
                       than for conversing adults, but given the lively posturing and facial animation, it appears en-
                       gaging. The naive subjects readily adapted to this pace and did not seem to find it awkward.
   169   170   171   172   173   174   175   176   177   178   179