Page 223 - Designing Sociable Robots
P. 223
breazeal-79017 book March 18, 2002 14:16
204 Chapter 11
Guidelines from Animation
The earliest examples of lip synchronization for animated characters dates back to the
1940’s in classical animation (Blair, 1949), and back to the 1970s for computer-animated
characters (Parke, 1972). In these early works, all of the lip animation was crafted by hand
(a very time-consuming process). Over time, a set of guidelines evolved that are largely
adhered to by animation artists today (Madsen, 1969).
According to Madsen, simplicity is the secret to successful lip animation. Extreme ac-
curacy for cartoon animation often looks forced or unnatural. Thus, the goal in animation
is not to always imitate realistic lip motions, but to create a visual shorthand that passes
unchallenged by the viewer (Madsen, 1969). As the realism of the character increases,
however, the accuracy of the lip synchronization follows.
Kismet is a fanciful and cartoon-like character, so the guidelines for cartoon animation
apply. In this case, the guidelines suggest that the animator focus on vowel lip motions
(especially o and w) accented with consonant postures (m, b, p) for lip closing. Precision
of these consonants gives credibility to the generalized patterns of vowels. The transitions
between vowels and consonants should be reasonable approximations of lip and jaw move-
ment. Fortunately, more latitude is granted for more fanciful characters. The mechanical
response time of Kismet’s lip and jaw motors places strict constraints on how fast the lips
and jaw can transition from posture to posture. Madsen also stresses that care must be taken
in conveying emotion, as the expression of voice and face can change dramatically.
Extracting Lip Synch Info
To implement lip synchronization on Kismet, a variety of information must be computed
in real-time from the speech signal. By placing DECtalk in memory mode and issuing the
command string (utterance with synthesizer settings), the DECtalk software generates the
speech waveform and writes it to memory (a 11.025 kHz waveform). In addition, DECtalk
extracts time-stamped phoneme information. From the speech waveform, one can compute
its time-varying energy over a window size of 335 samples, taking care to synchronize
the phoneme and energy information, and send (phoneme[t], energy[t]) pairs to the QNX
machine at 33 Hz to coordinate jaw and lip motor control. A similar technique using
DECtalk’s phoneme extraction capability is reported by Waters and Levergood (1993) for
real-time lip synchronization for computer-generated facial animation.
To control the jaw, the QNX machine receives the phoneme and energy information and
updates the commanded jaw position at 10 Hz. The mapping from energy to jaw opening is
linear, bounded within a range where the minimum position corresponds to a closed mouth,
and the maximum position corresponds to an open mouth characteristic of surprise. Using
only energy to control jaw position produces a lively effect but has its limitations (Parke &
Waters, 1996). For Kismet, the phoneme information is used to make sure that the jaw is

