Page 121 - Designing Sociable Robots
P. 121

breazeal-79017  book  March 18, 2002  14:54





                       102                                                              Chapter 7





                         Given the motivation of being able to use natural speech as a training signal for Kismet,
                       it remains to be seen how the existing system needs to be improved or changed to serve this
                       purpose. Naturally occurring robot-directed speech doesn’t come in nicely packaged sound
                       bites.Oftenthereisclipping,multipleprosodiccontoursofdifferenttypesinlongutterances,
                       and other background noise (doors slamming, people talking, etc.). Again, targeting infant-
                       caregiverinteractionshelpsalleviatetheseissues,asinfant-directedspeechisslower,shorter,
                       and more exaggerated. The collection of robot-directed utterances, however, demonstrates
                       a need to address these issues carefully.
                         The recognizer in its current implementation is specific to female speakers, and it is
                       particularly tuned to women who can use motherese effectively. Granted, not all people
                       will want to use motherese to instruct robots. At this early state of research, however, I am
                       willing to exploit naturally occurring simplifications of robot-directed speech to explore
                       human-style socially situated learning scenarios. Given the classifier’s strong performance
                       for the caregivers (those who will instruct the robot intensively), and decent performance
                       for other female speakers (especially for prohibition and approval), I am quite encouraged
                       at these early results. Future improvements include either training a male adult model, or
                       making the current model more gender-neutral.
                         For instructional purposes, the question remains: How good is good enough? A per-
                       formance of 70 to 80 percent of five-way classifiers for recognizing emotional speech is
                       regarded as state of the art. In practice, within an instructional setting, this may be an
                       unacceptable number of misclassifications. As a result, our approach has taken care to min-
                       imize the number of “bad” misclassifications. The social context is also exploited to reduce
                       misclassifications further (such as soothing versus neutral). Finally, expressive feedback
                       is provided to the caregivers so they can make sure that the robot properly “understood”
                       their intent. By incorporating expressive feedback, I have already observed some intriguing
                       social dynamics that arise with naive female subjects. I intend to investigate these social
                       dynamics further so that they can be used to advantage in instructional scenarios.
                         Toprovidethehumaninstructorwithgreaterprecisioninissuingvocalfeedback,onemust
                       look beyond how something is said to what is said. Since the underlying speech recognition
                       system (running on the Linux machine) is speaker-independent, this will boost recognition
                       performance for both males and females. It is also a fascinating question of how the robot
                       could learn the valence and arousal associated with particular utterances by bootstrapping
                       from the correlation between those phonemic sequences that show particular persistence
                       during each of the four classes of affective intents. Over time, Kismet could associate the
                       utterance“Goodrobot!”withpositivevalence,“No,stopthat!”withnegativevalence,“Look
                       at this!” with increased arousal, and “Oh, it’s ok,” with decreased arousal by grounding it in
                       an affective context and Kismet’s emotional system. Developmental psycholinguists posit
                       that human infants learn their first meanings through this kind of affectively-grounded social
   116   117   118   119   120   121   122   123   124   125   126