Page 227 - Designing Sociable Robots
P. 227
breazeal-79017 book March 18, 2002 14:16
208 Chapter 11
conveyance (Fleming & Dobbs, 1999). Kismet’s ten lip postures tend toward the absolute
minimal set specified by Fleming and Dobbs (1999), but is reasonable given its physical
appearance. As the robot speaks, new lip posture targets are specified at 33 Hz. Since the
phonemesdonotchangethisquickly,manyofthephonemesrepeat.Thereisaninherentlimit
in how fast Kismet’s lip and jaw motors can move to the next commanded, so the challenge
of co-articulation is somewhat addressed of by the physics of the motors and mechanism.
Lip synchronization is only part of the equation, however. Faces are not completely still
when speaking, but move in synchrony to provide emphasis along with the speech. Using
the energy of the speech signal to animate Kismet’s face (along with the lips and jaw) greatly
enhances the impression that Kismet “means” what it says. For Kismet, the energy of the
speech signal influences the movement of its eyelids and ears. Larger speech amplitudes
result in a proportional widening of the eyes and downward pulse of the ears. This adds a
nice degree of facial emphasis to accompany the stress of the vocalization.
Since the speech signal influences facial animation, the emotional correlates of facial
posture must be blended with the animation arising from speech. How this is accomplished
within the face control motor system is described at length in chapter 10. The emotional
expression establishes the baseline facial posture about which all facial animation moves.
The current “emotional” state also influences the speed with which the facial actuators move
(lower arousal results in slower movements, higher arousal results in quicker movements).
In addition, emotions that correspond to higher arousal produce more energetic speech,
resulting in bigger amplitude swings about the expression baseline. Similarly, emotions
that correspond to lower arousal produce less energetic speech, which results in smaller
amplitudes. The end product is a highly expressive and coordinated movement of face
with voice. For instance, angry sounding speech is accompanied by large and quick twitchy
movements of the ears eyelids. This undeniably conveys agitation and irritation. In contrast,
sad sounding speech is accompanied by slow, droopy, listless movements of the ears and
eyelids. This conveys a forlorn quality that often evokes sympathy from the human observer.
11.6 Limitations and Extensions
Kismet’s expressive speech can certainly be improved. In the current implementation I
have only included those acoustic correlates that have a global influence on the speech
signal and do not require local analysis of the sentence structure. I currently modulate voice
quality, speech rate, pitch range, average pitch, intensity, and the global pitch contour. Data
from naive subjects is promising, although more could certainly be done. I have done very
little with changes in articulation. The precision or imprecision of articulation could be
enhanced by substituting voiced for unvoiced phonemes as Cahn describes in her thesis.

