Page 112 - Designing Sociable Robots
P. 112

breazeal-79017  book  March 18, 2002  14:54





                       The Auditory System                                                   93





                       the pitch contour indicated whether the contour contained a down-sweep segment. It was
                       calculated by performing a linear fit on the contour segment starting at the maximum peak.
                       This classifier’s average performance is 80.3 percent.

                       Stage 2B: Approval-attention versus prohibition versus high-intensity neutral  A
                       combination of pitch mean and energy variance works well in this stage. The resulting
                       classifier’s average performance is 90.0 percent. Based on Fernald’s prototypical prosodic
                       patterns, it was speculated that pitch variance would be a useful feature for distinguish-
                       ing between prohibition and the approval-attention cluster. Adding pitch variance into the
                       feature set increased the classifier’s average performance to 92.1 percent.
                       Stage 3: Approval versus attention  Since the approval class and attention class span
                       the same region in the global pitch versus energy feature space, prior knowledge (provided
                       by Fernald’s prototypical prosodic contours) gave the basis to introduce a new feature. As
                       mentioned above, approvals are characterized by an exaggerated rise-fall pitch contour.
                       This particular pitch pattern proved useful in distinguishing between the two classes. First,
                       a three-degree polynomial fit was performed on each pitch segment. Each segment’s slope
                       sequence was analyzed for a positive slope followed by a negative slope with magnitudes
                       higher than a threshold value. The longest pitch segment that contributed to the rise-fall
                       pattern (which was 0 if the pattern was non-existent) was recorded. This feature, together
                       with pitch variance, was used in the final classifier and generated an average performance
                       of 70.5 percent. Approval and attention are the most difficult to classify because both
                       classes exhibit high pitch and intensity. Although the shape of the pitch contour helped
                       to distinguish between the two classes, it is very difficult to achieve high classification
                       performance without looking at the linguistic content of the utterance.

                       Overall Classification Performance
                       The final classifier was evaluated using a new test set generated by the same female speakers,
                       containing 371 utterances. Because each mini-classifier was trained using different portions
                       of the original database (for the single-stage classifier), a new data set was gathered to ensure
                       that no mini-classifier stage was tested on data used to train it. Table 7.4 shows the resulting
                       classification performance and compares it to an instance of the cross-validation results
                       of the best single-stage five-way classifier obtained using the five features described in
                       section 7.4. Both classifiers perform very well on prohibition utterances. The multi-stage
                       classifier performs significantly better in classifying the difficult classes, i.e., approval versus
                       attention and soothing versus neutral. This verifies that the features encoding the shape of the
                       pitch contours (derived from prior knowledge provided by Fernald’s prototypical prosodic
                       patterns) were very useful.
   107   108   109   110   111   112   113   114   115   116   117