Page 107 - Designing Sociable Robots
P. 107
breazeal-79017 book March 18, 2002 14:54
88 Chapter 7
to unvoiced speech. Even after this procedure, observation of the resulting pitch contours
still indicated the presence of substantial noise. Specifically, a significant number of er-
rors were discovered in the high pitch value region (above 500 Hz). Therefore, additional
preprocessing was performed on all pitch data. For each pitch contour, a histogram of ten
regions was constructed. Using the heuristic that the pitch contour was relatively smooth,
it was determined that if only a few pitch values were located in the high region while the
rest were much lower (and none resided in between), then the high values were likely to
be noise. Note that this process did not eliminate high but smooth pitch contour since pitch
values would be distributed evenly across nearby regions.
Classification Method
In all training phases each class of data was modeled using a Gaussian mixture model,
updated with the EM algorithm and a Kurtosis-based approach for dynamically deciding
the appropriate number of kernels (Vlassis & Likas, 1999). Due to the limited set of training
data, cross-validation in all classification processes was performed. Specifically, a subset of
data was set aside to train a classifier using the remaining data. The classifier’s performance
was then tested on the held-out test set. This process was repeated 100 times per classifier.
The mean and variance of the percentage of correctly classified test data were calculated to
estimate the classifier’s performance.
As shown in figure 7.3, the preprocessed pitch contour in the labeled data resembles
Fernald’s prototypical prosodic contours for approval, attention, prohibition, and comfort/
soothing. A set of global pitch and energy related features (see table 7.1) were used to rec-
ognize these proposed patterns. All pitch features were measured using only non-zero pitch
values. Using this feature set, a sequential forward feature selection process was applied to
construct an optimal classifier. Each possible feature pair’s classification performance was
measured and sorted from highest to lowest. Successively, a feature pair from the sorted list
was added into the selected feature set to determine the best n features for an optimal clas-
sifier. Table 7.2 shows the results of the classifiers constructed using the best eight feature
pairs. Classification performance increases as more features are added, reaches maximum
(78.77 percent) with five features in the set, and levels off above 60 percent with six or
more features. It was found that global pitch and energy measures were useful in roughly
separating the proposed patterns based on arousal (largely distinguished by energy mea-
sures) and valence (largely distinguished by pitch measures). However, further processing
was required to distinguish each of the five classes distinctly.
Accordingly, the classifier consists of several mini-classifiers executing in stages. In
the beginning stages, the classifier uses global pitch and energy features to separate some
of the classes into pairs (in this case, clusters of soothing along with low-energy neutral,
prohibition along with high-energy neutral, and attention along with approval were formed).

