Page 205 - Designing Sociable Robots
P. 205

breazeal-79017  book  March 18, 2002  14:16





                       186                                                             Chapter 11





                       as well as supplement the information transmitted through the verbal channel. For Kismet,
                       the information communicated to the human is grounded in affect. The facial displays are
                       used to help regulate the dynamics of the exchange. (Video demonstrations of Kismet’s
                       expressive displays and the accompanying vocalizations are included on the CD-ROM in
                       the second section, “Readable Expressions.”)

                       11.1  Emotion in Human Speech

                       There has been an increasing amount of work in identifying those acoustic features that
                       vary with the speaker’s affective state (Murray & Arnott, 1993). Changes in the speaker’s
                       autonomic nervous system can account for some of the most significant changes, where the
                       sympathetic and parasympathetic subsystems regulate arousal in opposition. For instance,
                       when a subject is in a state of fear, anger, or joy, the sympathetic nervous system is aroused.
                       This induces an increased heart rate, higher blood pressure, changes in depth of respiratory
                       movements, greater sub-glottal pressure, dryness of the mouth, and occasional muscle
                       tremor. The resulting speech is faster, louder, and more precisely enunciated with strong
                       high-frequency energy, a higher average pitch, and wider pitch range. In contrast, when
                       a subject is tired, bored, or sad, the parasympathetic nervous system is more active. This
                       causes a decreased heart rate, lower blood pressure, and increased salivation. The resulting
                       speechistypicallyslower,lower-pitched,moreslurred,andwithlittlehighfrequencyenergy.
                       Picard (1997) presents a nice overview of work in this area.
                         Table 11.1 summarizes the effects of emotion in speech tend to alter the pitch, timing,
                       voice quality, and articulation of the speech signal. Several of these features, however, are
                       also modulated by the prosodic effects that the speaker uses to communicate grammatical
                       structure and lexical correlates. These tend to have a more localized influence on the speech
                       signal, such as emphasizing a particular word. For recognition tasks, this increases the
                       challenge of isolating those feature characteristics modulated by emotion. Even humans are
                       not perfect at perceiving the intended emotion for those emotional states that have similar
                       acoustic characteristics. For instance, surprise can be perceived or understood as either
                       joyous surprise (i.e., happiness) or apprehensive surprise (i.e., fear). Disgust is a form of
                       disapproval and can be confused with anger.
                         There have been a few systems developed to synthesize emotional speech. The Affect Edi-
                       tor by Janet Cahn is among the earliest work in this area (Cahn, 1990). Her system was based
                       on DECtalk3, a commercially available text-to-speech speech synthesizer. Given an English
                       sentence and an emotional quality (one of anger, disgust, fear, joy, sorrow, or surprise), she
                       developed a methodology for mapping the emotional correlates of speech (changes in pitch,
                       timing, voice quality, and articulation) onto the underlying DECtalk synthesizer settings.
   200   201   202   203   204   205   206   207   208   209   210