Page 106 - Designing Sociable Robots
P. 106
breazeal-79017 book March 18, 2002 14:54
The Auditory System 87
Pitch, Pitch,
Periodicity, F 1 … F n
Energy
Energy Approval,
Attentional Bid,
Robot- Speech Filter Feature
Directed Processing and Extractor Classifier Prohibition,
System Pre-processing Soothing,
Speech
Neutral
Figure 7.2
The spoken affective intent recognizer.
7.4 The Affective Intent Classifier
As shown in figure 7.2, the affective speech recognizer receives robot-directed speech as
input. The speech signal is analyzed by the low-level speech processing system, produc-
ing time-stamped pitch (Hz), percent periodicity (a measure of how likely a frame is a
2
voiced segment), energy (dB), and phoneme values in real-time. The next module per-
forms filtering and pre-processing to reduce the amount of noise in the data. The pitch
value of a frame is simply set to 0 if the corresponding percent periodicity indicates that the
frame is more likely to correspond to unvoiced speech. The resulting pitch and energy data
are then passed through the feature extractor, which calculates a set of selected features
(F 1 to F n ). Finally, based on the trained model, the classifier determines whether the
computed features are derived from an approval, an attentional bid, a prohibition, soothing
speech, or a neutral utterance.
Two female adults who frequently interact with Kismet as caregivers were recorded. The
speakers were asked to express all five affective intents (approval, attentional bid, prohibi-
tion, comfort, and neutral) during the interaction. Recordings were made using a wireless
microphone, and the output signal was sent to the low-level speech processing system run-
ning on Linux. For each utterance, this phase produced a 16-bit single channel, 8 kHz signal
(in a .wav format) as well as its corresponding real-time pitch, percent periodicity, energy,
and phoneme values. All recordings were performed in Kismet’s usual environment to min-
imize variability of environment-specific noise. Samples containing extremely loud noises
(door slams, etc.) were eliminated, and the remaining data set were labeled according to
the speakers’ affective intents during the interaction. There were a total of 726 utterances
in the final data set—approximately 145 utterances per class.
The pitch value of a frame was set to 0 if the corresponding percent periodicity was
lower than a threshold value. This indicates that the frame is more likely to correspond
2. This auditory processing code is provided by the Spoken Language Systems Group at MIT. For now, the phoneme
information is not used in the recognizer.

