Page 206 - Designing Sociable Robots
P. 206

breazeal-79017  book  March 18, 2002  14:16





                       Expressive Vocalization System                                       187





                       Table 11.1
                       Typical effect of emotions on adult human speech, adapted from Murray and Arnott (1993). The table has been
                       extended to include some acoustic correlates of the emotion of surprise.

                                   Fear       Anger      Sorrow     Joy       Disgust    Surprise
                       Speech Rate  Much      Slightly   Slightly   Faster or  Very Much  Much
                                   Faster     Faster     Slower     Slower    Slower     Faster
                       Pitch Average  Very Much  Very Much  Slightly  Much    Very Much  Much
                                   Higher     Higher     Lower      Higher    Lower      Higher
                       Pitch Range  Much      Much       Slightly   Much      Slightly
                                   Wider      Wider      Narrower   Wider     Wider
                       Intensity   Normal     Higher     Lower      Higher    Lower      Higher
                       Voice Quality  Irregular  Breathy  Resonant  Breathy   Grumbled
                                   Voicing    Chest Tone            Blaring   Chest Tone
                       Pitch Changes  Normal  Abrupt on  Downward   Smooth    Wide       Rising
                                              Stressed   Inflections  Upward   Downward   Contour
                                              Syllable              Inflections  Terminal
                                                                              Inflections
                       Articulation  Precise  Tense      Slurring   Normal    Normal



                       She took great care to introduce the global prosodic effects of emotion while still preserving
                       the more local influences of grammatical and lexical correlates of speech intonation. In a
                       different approach Jun Sato (see www.ee.seikei.ac.jp/user/junsato/research/)
                       trained a neural network to modulate a neutrally spoken speech signal (in Japanese) to
                       convey one of four emotional states (happiness, anger, sorrow, disgust). The neural network
                       was trained on speech spoken by Japanese actors. This approach has the advantage that
                       the output speech signal sounds more natural than purely synthesized speech. It has the
                       disadvantage, however, that the speech input to the system must be prerecorded.
                         WithrespecttogivingKismettheabilitytogenerateemotivevocalizations,Cahn’sworkis
                       a valuable resource. The DECtalk software gives us the flexibility to have Kismet generate
                       its own utterance by assembling strings of phonemes (with pitch accents). I use Cahn’s
                       technique for mapping the emotional correlates of speech (as defined by her vocal affect
                       parameters) to the underlying synthesizer settings. Because Kismet’s vocalizations are at
                       the proto-dialogue level, there is no grammatical structure. As a result, only producing the
                       purely global emotional influence on the speech signal is noteworthy.

                       11.2 Expressive Voice Synthesis

                       Cahn’s vocal affect parameters (VAP) alter the pitch, timing, voice quality, and articulation
                       aspects of the speech signal. She documented how these parameter settings can be set to
   201   202   203   204   205   206   207   208   209   210   211