Page 205 - Designing Sociable Robots
P. 205
breazeal-79017 book March 18, 2002 14:16
186 Chapter 11
as well as supplement the information transmitted through the verbal channel. For Kismet,
the information communicated to the human is grounded in affect. The facial displays are
used to help regulate the dynamics of the exchange. (Video demonstrations of Kismet’s
expressive displays and the accompanying vocalizations are included on the CD-ROM in
the second section, “Readable Expressions.”)
11.1 Emotion in Human Speech
There has been an increasing amount of work in identifying those acoustic features that
vary with the speaker’s affective state (Murray & Arnott, 1993). Changes in the speaker’s
autonomic nervous system can account for some of the most significant changes, where the
sympathetic and parasympathetic subsystems regulate arousal in opposition. For instance,
when a subject is in a state of fear, anger, or joy, the sympathetic nervous system is aroused.
This induces an increased heart rate, higher blood pressure, changes in depth of respiratory
movements, greater sub-glottal pressure, dryness of the mouth, and occasional muscle
tremor. The resulting speech is faster, louder, and more precisely enunciated with strong
high-frequency energy, a higher average pitch, and wider pitch range. In contrast, when
a subject is tired, bored, or sad, the parasympathetic nervous system is more active. This
causes a decreased heart rate, lower blood pressure, and increased salivation. The resulting
speechistypicallyslower,lower-pitched,moreslurred,andwithlittlehighfrequencyenergy.
Picard (1997) presents a nice overview of work in this area.
Table 11.1 summarizes the effects of emotion in speech tend to alter the pitch, timing,
voice quality, and articulation of the speech signal. Several of these features, however, are
also modulated by the prosodic effects that the speaker uses to communicate grammatical
structure and lexical correlates. These tend to have a more localized influence on the speech
signal, such as emphasizing a particular word. For recognition tasks, this increases the
challenge of isolating those feature characteristics modulated by emotion. Even humans are
not perfect at perceiving the intended emotion for those emotional states that have similar
acoustic characteristics. For instance, surprise can be perceived or understood as either
joyous surprise (i.e., happiness) or apprehensive surprise (i.e., fear). Disgust is a form of
disapproval and can be confused with anger.
There have been a few systems developed to synthesize emotional speech. The Affect Edi-
tor by Janet Cahn is among the earliest work in this area (Cahn, 1990). Her system was based
on DECtalk3, a commercially available text-to-speech speech synthesizer. Given an English
sentence and an emotional quality (one of anger, disgust, fear, joy, sorrow, or surprise), she
developed a methodology for mapping the emotional correlates of speech (changes in pitch,
timing, voice quality, and articulation) onto the underlying DECtalk synthesizer settings.

