Page 121 - Designing Sociable Robots
P. 121
breazeal-79017 book March 18, 2002 14:54
102 Chapter 7
Given the motivation of being able to use natural speech as a training signal for Kismet,
it remains to be seen how the existing system needs to be improved or changed to serve this
purpose. Naturally occurring robot-directed speech doesn’t come in nicely packaged sound
bites.Oftenthereisclipping,multipleprosodiccontoursofdifferenttypesinlongutterances,
and other background noise (doors slamming, people talking, etc.). Again, targeting infant-
caregiverinteractionshelpsalleviatetheseissues,asinfant-directedspeechisslower,shorter,
and more exaggerated. The collection of robot-directed utterances, however, demonstrates
a need to address these issues carefully.
The recognizer in its current implementation is specific to female speakers, and it is
particularly tuned to women who can use motherese effectively. Granted, not all people
will want to use motherese to instruct robots. At this early state of research, however, I am
willing to exploit naturally occurring simplifications of robot-directed speech to explore
human-style socially situated learning scenarios. Given the classifier’s strong performance
for the caregivers (those who will instruct the robot intensively), and decent performance
for other female speakers (especially for prohibition and approval), I am quite encouraged
at these early results. Future improvements include either training a male adult model, or
making the current model more gender-neutral.
For instructional purposes, the question remains: How good is good enough? A per-
formance of 70 to 80 percent of five-way classifiers for recognizing emotional speech is
regarded as state of the art. In practice, within an instructional setting, this may be an
unacceptable number of misclassifications. As a result, our approach has taken care to min-
imize the number of “bad” misclassifications. The social context is also exploited to reduce
misclassifications further (such as soothing versus neutral). Finally, expressive feedback
is provided to the caregivers so they can make sure that the robot properly “understood”
their intent. By incorporating expressive feedback, I have already observed some intriguing
social dynamics that arise with naive female subjects. I intend to investigate these social
dynamics further so that they can be used to advantage in instructional scenarios.
Toprovidethehumaninstructorwithgreaterprecisioninissuingvocalfeedback,onemust
look beyond how something is said to what is said. Since the underlying speech recognition
system (running on the Linux machine) is speaker-independent, this will boost recognition
performance for both males and females. It is also a fascinating question of how the robot
could learn the valence and arousal associated with particular utterances by bootstrapping
from the correlation between those phonemic sequences that show particular persistence
during each of the four classes of affective intents. Over time, Kismet could associate the
utterance“Goodrobot!”withpositivevalence,“No,stopthat!”withnegativevalence,“Look
at this!” with increased arousal, and “Oh, it’s ok,” with decreased arousal by grounding it in
an affective context and Kismet’s emotional system. Developmental psycholinguists posit
that human infants learn their first meanings through this kind of affectively-grounded social

