Page 100 - Designing Sociable Robots
P. 100

breazeal-79017  book  March 18, 2002  14:54







                      7 The Auditory System



                       Human speech provides a natural and intuitive interface both for communicating with and
                       teaching humanoid robots. In general, the acoustic pattern of speech contains three kinds
                       of information: who the speaker is, what the speaker said, and how the speaker said it. This
                       chapter focuses on the problem of recognizing affective intent in robot-directed speech.
                       The work presented in this chapter was carried out in collaboration with Lijin Aryananda
                       (Breazeal & Aryananda, 2002).
                         When extracting the affective message of a speech signal, there are two related yet dis-
                       tinct questions one can ask. The first: “What emotion is being expressed?” In this case,
                       the answer describes an emotional quality—such as sounding angry, or frightened, or dis-
                       gusted. Each emotional state causes changes in the autonomic nervous system. This, in turn,
                       influences heart rate, blood pressure, respiratory rate, sub-glottal pressure, salivation, and
                       so forth. These physiological changes produce global adjustments to the acoustic correlates
                       of speech—influencing pitch, energy, timing, and articulation. There have been a number
                       of vocal emotion recognition systems developed in the past few years that use different
                       variations and combinations of those acoustic features with different types of learning al-
                       gorithms (Dellaert et al., 1996; Nakatsu et al., 1999). To give a rough sense of performance,
                       a five-way classifier operating at approximately 80 percent is considered state of the art
                       (at the time of this writing). This is impressive considering that humans are far from perfect
                       in recognizing emotion from speech alone. Some have attempted to use multi-modal cues
                       (facial expression with expressive speech) to improve recognition performance (Chen &
                       Huang, 1998).

                       7.1 Recognizing Affect in Human Speech

                       For the purposes of training a robot, however, the raw emotional content of the speaker’s
                       voice is only part of the message. This leads us to the second, related question: What is the
                       affective intent of the message? Answers to this question may be that the speaker was prais-
                       ing, prohibiting, or alerting the recipient of the message. A few researchers have developed
                       systems that can recognize speaker approval versus speaker disapproval from child-directed
                       speech (Roy & Pentland, 1996), or recognize praise, prohibition, and attentional bids from
                       infant-directed speech (Slaney & McRoberts, 1998). For the remainder of this chapter, I
                       discuss how this idea could be extended to serve as a useful training signal for Kismet. Note
                       that Kismet does not learn from humans yet, but this is an important capability that could
                       support socially situated learning.
                         Developmental psycholinguists have extensively studied how affective intent is commu-
                       nicated to preverbal infants (Fernald, 1989; Grieser & Kuhl, 1988). Infant-directed speech is
                       typically quite exaggerated in pitch and intensity (Snow, 1972). From the results of a series










                                                         81
   95   96   97   98   99   100   101   102   103   104   105