Page 100 - Socially Intelligent Agents Creating Relationships with Computers and Robots

P. 100

Emotion Recognition Agents for Speech Signal 83

System Structure. The ER system is part of a new generation computer-
ized call center that integrates databases, decision support systems, and differ-
ent media such as voice messages, e-mail messages and a WWW server into
one information space. The system consists of three processes: a wave ﬁle
monitor, a voice mail center and a message prioritizer. The wave ﬁle monitor
reads periodically the contents of the voice message directory, compares it to
the list of processed messages, and, if a new message is detected, it processes
the message and creates a summary and an emotion description ﬁle. The sum-
mary ﬁle contains the following information: ﬁve numbers that describe the
distribution of emotions, and the length and percentage of silence in the mes-
sage. The emotion description ﬁle stores data describing the emotional content
of each 1–3 second chunk of message. The prioritizer is a process that reads
summary ﬁles for processed messages, sorts them taking into account their
emotional content, length and some other criteria, and suggests an assignment
of agents to return back the calls. Finally, it generates a web page, which lists
all current assignments. The voice mail center is an additional tool that helps
operators and supervisors to visualize the emotional content of voice messages.

5. Conclusion
We have explored how well people and computers recognize emotions in
speech. Several conclusions can be drawn from the above results. First, de-
coding emotions in speech is a complex process that is inﬂuenced by cultural,
social, and intellectual characteristics of subjects. People are not perfect in
decoding even such manifest emotions as anger and happiness. Second, anger
is the most recognizable and easier to portray emotion. It is also the most im-
portant emotion for business. But anger has numerous variants (for example,
hot anger, cold anger, etc.) that can bring variability into acoustic features and
dramatically inﬂuence the accuracy of recognition. Third, pattern recognition
techniques based on neural networks proved to be useful for emotion recogni-
tion in speech and for creating customer relationship management systems.

Notes
1. Each utterance was recorded using a close-talk microphone. The ﬁrst 100 utterances were recorded
at 22-kHz/8 bit and the remaining 600 utterances at 22-kHz/16 bit.
2. The rows and the columns represent true and evaluated categories, respectively. For example, the
second row says that 11.9% of utterances that were portrayed as happy were evaluated as normal (unemo-
tional), 61.4% as true happy, 10.1% as angry, 4.1% as sad, and 12.5% as afraid.
3. The speaking rate was calculated as the inverse of the average length of the voiced part of utterance.
For all other parameters we calculated the following statistics: mean, standard deviation, minimum, max-
imum, and range. Additionally, for F0 the slope was calculated as a linear regression for voiced part of
speech, i.e. the line that ﬁts the pitch contour. We also calculated the relative voiced energy. Altogether we
have estimated 43 features for each utterance.
4. We ran RELIEF-F for the s70 data set varying the number of nearest neighbors from 1 to 12, and
ordered features according their sum of ranks.

95 96 97 98 99 100 101 102 103 104 105