Page 98 - Socially Intelligent Agents Creating Relationships with Computers and Robots
P. 98
Emotion Recognition Agents for Speech Signal 81
3.4 Computer Recognition
To recognize emotions in speech we tried the following approaches: K-
nearest neighbors, neural networks, ensembles of neural network classifiers,
and set of experts. In general, the approach that is based on ensembles of
neural network recognizers outperformed the others, and it was chosen for
implementation at the next stage. We summarize below the results obtained
with the different techniques.
K-nearest neighbors. We used 70% of the s70 data set as database of
cases for comparison and 30% as test set. We ran the algorithm for K = 1
to 15 and for number of features 8, 10, and 14. The best average accuracy of
recognition (∼55%) can be reached using 8 features, but the average accuracy
for anger is much higher (∼65%) for 10- and 14-feature sets. All recognizers
performed very poor for fear (about 5–10%).
Neural networks. We used a two-layer backpropagation neural network
architecture with a 8-, 10- or 14-element input vector, 10 or 20 nodes in the
hidden sigmoid layer and five nodes in the output linear layer. To train and
test our algorithms we used the data sets s70, s80 and s90, randomly split into
training (70% of utterances) and test (30%) subsets. We created several neural
network classifiers trained with different initial weight matrices. This approach
applied to the s70 data set and the 8-feature set gave an average accuracy of
about 65% with the following distribution for emotion categories: normal state
is 55–65%, happiness is 60–70%, anger is 60–80%, sadness is 60–70%, and
fear is 25–50%.
7
Ensembles of neural network classifiers. We used ensemble sizes from
7 to 15 classifiers. Results for ensembles of 15 neural networks, the s70 data
set, all three sets of features, and both neural network architectures (10 and 20
neurons in the hidden layer) were the following. The accuracy for happiness
remained the same (∼65%) for the different sets of features and architectures.
The accuracy for fear was relatively low (35–53%). The accuracy for anger
started at 73% for the 8-feature set and increased to 81% for the 14-feature set.
The accuracy for sadness varied from 73% to 83% and achieved its maximum
for the 10-feature set. The average total accuracy was about 70%.
Set of experts. This approach is based on the following idea. Instead of
training a neural network to recognize all emotions, we can train a set of spe-
8
cialists or experts that can recognize only one emotion and then combine their
results to classify a given sample. The average accuracy of emotion recogni-
tion for this approach was about 70% except for fear, which was ∼44% for the
10-neuron, and ∼56% for the 20-neuron architecture. The accuracy of non-