Page 106 - Designing Sociable Robots
P. 106

breazeal-79017  book  March 18, 2002  14:54





                       The Auditory System                                                   87





                                        Pitch,      Pitch,
                                       Periodicity,          F 1 … F n
                                                    Energy
                                        Energy                                  Approval,
                                                                              Attentional Bid,
                       Robot-      Speech     Filter     Feature
                       Directed   Processing   and      Extractor  Classifier  Prohibition,
                                   System   Pre-processing                      Soothing,
                       Speech
                                                                                Neutral
                       Figure 7.2
                       The spoken affective intent recognizer.
                       7.4  The Affective Intent Classifier

                       As shown in figure 7.2, the affective speech recognizer receives robot-directed speech as
                       input. The speech signal is analyzed by the low-level speech processing system, produc-
                       ing time-stamped pitch (Hz), percent periodicity (a measure of how likely a frame is a
                                                                2
                       voiced segment), energy (dB), and phoneme values in real-time. The next module per-
                       forms filtering and pre-processing to reduce the amount of noise in the data. The pitch
                       value of a frame is simply set to 0 if the corresponding percent periodicity indicates that the
                       frame is more likely to correspond to unvoiced speech. The resulting pitch and energy data
                       are then passed through the feature extractor, which calculates a set of selected features
                       (F 1 to F n ). Finally, based on the trained model, the classifier determines whether the
                       computed features are derived from an approval, an attentional bid, a prohibition, soothing
                       speech, or a neutral utterance.
                         Two female adults who frequently interact with Kismet as caregivers were recorded. The
                       speakers were asked to express all five affective intents (approval, attentional bid, prohibi-
                       tion, comfort, and neutral) during the interaction. Recordings were made using a wireless
                       microphone, and the output signal was sent to the low-level speech processing system run-
                       ning on Linux. For each utterance, this phase produced a 16-bit single channel, 8 kHz signal
                       (in a .wav format) as well as its corresponding real-time pitch, percent periodicity, energy,
                       and phoneme values. All recordings were performed in Kismet’s usual environment to min-
                       imize variability of environment-specific noise. Samples containing extremely loud noises
                       (door slams, etc.) were eliminated, and the remaining data set were labeled according to
                       the speakers’ affective intents during the interaction. There were a total of 726 utterances
                       in the final data set—approximately 145 utterances per class.
                         The pitch value of a frame was set to 0 if the corresponding percent periodicity was
                       lower than a threshold value. This indicates that the frame is more likely to correspond


                       2. This auditory processing code is provided by the Spoken Language Systems Group at MIT. For now, the phoneme
                       information is not used in the recognizer.
   101   102   103   104   105   106   107   108   109   110   111