Page 112 - Designing Sociable Robots
P. 112
breazeal-79017 book March 18, 2002 14:54
The Auditory System 93
the pitch contour indicated whether the contour contained a down-sweep segment. It was
calculated by performing a linear fit on the contour segment starting at the maximum peak.
This classifier’s average performance is 80.3 percent.
Stage 2B: Approval-attention versus prohibition versus high-intensity neutral A
combination of pitch mean and energy variance works well in this stage. The resulting
classifier’s average performance is 90.0 percent. Based on Fernald’s prototypical prosodic
patterns, it was speculated that pitch variance would be a useful feature for distinguish-
ing between prohibition and the approval-attention cluster. Adding pitch variance into the
feature set increased the classifier’s average performance to 92.1 percent.
Stage 3: Approval versus attention Since the approval class and attention class span
the same region in the global pitch versus energy feature space, prior knowledge (provided
by Fernald’s prototypical prosodic contours) gave the basis to introduce a new feature. As
mentioned above, approvals are characterized by an exaggerated rise-fall pitch contour.
This particular pitch pattern proved useful in distinguishing between the two classes. First,
a three-degree polynomial fit was performed on each pitch segment. Each segment’s slope
sequence was analyzed for a positive slope followed by a negative slope with magnitudes
higher than a threshold value. The longest pitch segment that contributed to the rise-fall
pattern (which was 0 if the pattern was non-existent) was recorded. This feature, together
with pitch variance, was used in the final classifier and generated an average performance
of 70.5 percent. Approval and attention are the most difficult to classify because both
classes exhibit high pitch and intensity. Although the shape of the pitch contour helped
to distinguish between the two classes, it is very difficult to achieve high classification
performance without looking at the linguistic content of the utterance.
Overall Classification Performance
The final classifier was evaluated using a new test set generated by the same female speakers,
containing 371 utterances. Because each mini-classifier was trained using different portions
of the original database (for the single-stage classifier), a new data set was gathered to ensure
that no mini-classifier stage was tested on data used to train it. Table 7.4 shows the resulting
classification performance and compares it to an instance of the cross-validation results
of the best single-stage five-way classifier obtained using the five features described in
section 7.4. Both classifiers perform very well on prohibition utterances. The multi-stage
classifier performs significantly better in classifying the difficult classes, i.e., approval versus
attention and soothing versus neutral. This verifies that the features encoding the shape of the
pitch contours (derived from prior knowledge provided by Fernald’s prototypical prosodic
patterns) were very useful.

