Page 107 - Designing Sociable Robots

P. 107

breazeal-79017 book March 18, 2002 14:54

88 Chapter 7

to unvoiced speech. Even after this procedure, observation of the resulting pitch contours
still indicated the presence of substantial noise. Speciﬁcally, a signiﬁcant number of er-
rors were discovered in the high pitch value region (above 500 Hz). Therefore, additional
preprocessing was performed on all pitch data. For each pitch contour, a histogram of ten
regions was constructed. Using the heuristic that the pitch contour was relatively smooth,
it was determined that if only a few pitch values were located in the high region while the
rest were much lower (and none resided in between), then the high values were likely to
be noise. Note that this process did not eliminate high but smooth pitch contour since pitch
values would be distributed evenly across nearby regions.
Classiﬁcation Method

In all training phases each class of data was modeled using a Gaussian mixture model,
updated with the EM algorithm and a Kurtosis-based approach for dynamically deciding
the appropriate number of kernels (Vlassis & Likas, 1999). Due to the limited set of training
data, cross-validation in all classiﬁcation processes was performed. Speciﬁcally, a subset of
data was set aside to train a classiﬁer using the remaining data. The classiﬁer’s performance
was then tested on the held-out test set. This process was repeated 100 times per classiﬁer.
The mean and variance of the percentage of correctly classiﬁed test data were calculated to
estimate the classiﬁer’s performance.
As shown in ﬁgure 7.3, the preprocessed pitch contour in the labeled data resembles
Fernald’s prototypical prosodic contours for approval, attention, prohibition, and comfort/
soothing. A set of global pitch and energy related features (see table 7.1) were used to rec-
ognize these proposed patterns. All pitch features were measured using only non-zero pitch
values. Using this feature set, a sequential forward feature selection process was applied to
construct an optimal classiﬁer. Each possible feature pair’s classiﬁcation performance was
measured and sorted from highest to lowest. Successively, a feature pair from the sorted list
was added into the selected feature set to determine the best n features for an optimal clas-
siﬁer. Table 7.2 shows the results of the classiﬁers constructed using the best eight feature
pairs. Classiﬁcation performance increases as more features are added, reaches maximum
(78.77 percent) with ﬁve features in the set, and levels off above 60 percent with six or
more features. It was found that global pitch and energy measures were useful in roughly
separating the proposed patterns based on arousal (largely distinguished by energy mea-
sures) and valence (largely distinguished by pitch measures). However, further processing
was required to distinguish each of the ﬁve classes distinctly.
Accordingly, the classiﬁer consists of several mini-classiﬁers executing in stages. In
the beginning stages, the classiﬁer uses global pitch and energy features to separate some
of the classes into pairs (in this case, clusters of soothing along with low-energy neutral,
prohibition along with high-energy neutral, and attention along with approval were formed).

102 103 104 105 106 107 108 109 110 111 112