Page 100 - Designing Sociable Robots
P. 100
breazeal-79017 book March 18, 2002 14:54
7 The Auditory System
Human speech provides a natural and intuitive interface both for communicating with and
teaching humanoid robots. In general, the acoustic pattern of speech contains three kinds
of information: who the speaker is, what the speaker said, and how the speaker said it. This
chapter focuses on the problem of recognizing affective intent in robot-directed speech.
The work presented in this chapter was carried out in collaboration with Lijin Aryananda
(Breazeal & Aryananda, 2002).
When extracting the affective message of a speech signal, there are two related yet dis-
tinct questions one can ask. The first: “What emotion is being expressed?” In this case,
the answer describes an emotional quality—such as sounding angry, or frightened, or dis-
gusted. Each emotional state causes changes in the autonomic nervous system. This, in turn,
influences heart rate, blood pressure, respiratory rate, sub-glottal pressure, salivation, and
so forth. These physiological changes produce global adjustments to the acoustic correlates
of speech—influencing pitch, energy, timing, and articulation. There have been a number
of vocal emotion recognition systems developed in the past few years that use different
variations and combinations of those acoustic features with different types of learning al-
gorithms (Dellaert et al., 1996; Nakatsu et al., 1999). To give a rough sense of performance,
a five-way classifier operating at approximately 80 percent is considered state of the art
(at the time of this writing). This is impressive considering that humans are far from perfect
in recognizing emotion from speech alone. Some have attempted to use multi-modal cues
(facial expression with expressive speech) to improve recognition performance (Chen &
Huang, 1998).
7.1 Recognizing Affect in Human Speech
For the purposes of training a robot, however, the raw emotional content of the speaker’s
voice is only part of the message. This leads us to the second, related question: What is the
affective intent of the message? Answers to this question may be that the speaker was prais-
ing, prohibiting, or alerting the recipient of the message. A few researchers have developed
systems that can recognize speaker approval versus speaker disapproval from child-directed
speech (Roy & Pentland, 1996), or recognize praise, prohibition, and attentional bids from
infant-directed speech (Slaney & McRoberts, 1998). For the remainder of this chapter, I
discuss how this idea could be extended to serve as a useful training signal for Kismet. Note
that Kismet does not learn from humans yet, but this is an important capability that could
support socially situated learning.
Developmental psycholinguists have extensively studied how affective intent is commu-
nicated to preverbal infants (Fernald, 1989; Grieser & Kuhl, 1988). Infant-directed speech is
typically quite exaggerated in pitch and intensity (Snow, 1972). From the results of a series
81

