Page 100 - Socially Intelligent Agents Creating Relationships with Computers and Robots
P. 100
Emotion Recognition Agents for Speech Signal 83
System Structure. The ER system is part of a new generation computer-
ized call center that integrates databases, decision support systems, and differ-
ent media such as voice messages, e-mail messages and a WWW server into
one information space. The system consists of three processes: a wave file
monitor, a voice mail center and a message prioritizer. The wave file monitor
reads periodically the contents of the voice message directory, compares it to
the list of processed messages, and, if a new message is detected, it processes
the message and creates a summary and an emotion description file. The sum-
mary file contains the following information: five numbers that describe the
distribution of emotions, and the length and percentage of silence in the mes-
sage. The emotion description file stores data describing the emotional content
of each 1–3 second chunk of message. The prioritizer is a process that reads
summary files for processed messages, sorts them taking into account their
emotional content, length and some other criteria, and suggests an assignment
of agents to return back the calls. Finally, it generates a web page, which lists
all current assignments. The voice mail center is an additional tool that helps
operators and supervisors to visualize the emotional content of voice messages.
5. Conclusion
We have explored how well people and computers recognize emotions in
speech. Several conclusions can be drawn from the above results. First, de-
coding emotions in speech is a complex process that is influenced by cultural,
social, and intellectual characteristics of subjects. People are not perfect in
decoding even such manifest emotions as anger and happiness. Second, anger
is the most recognizable and easier to portray emotion. It is also the most im-
portant emotion for business. But anger has numerous variants (for example,
hot anger, cold anger, etc.) that can bring variability into acoustic features and
dramatically influence the accuracy of recognition. Third, pattern recognition
techniques based on neural networks proved to be useful for emotion recogni-
tion in speech and for creating customer relationship management systems.
Notes
1. Each utterance was recorded using a close-talk microphone. The first 100 utterances were recorded
at 22-kHz/8 bit and the remaining 600 utterances at 22-kHz/16 bit.
2. The rows and the columns represent true and evaluated categories, respectively. For example, the
second row says that 11.9% of utterances that were portrayed as happy were evaluated as normal (unemo-
tional), 61.4% as true happy, 10.1% as angry, 4.1% as sad, and 12.5% as afraid.
3. The speaking rate was calculated as the inverse of the average length of the voiced part of utterance.
For all other parameters we calculated the following statistics: mean, standard deviation, minimum, max-
imum, and range. Additionally, for F0 the slope was calculated as a linear regression for voiced part of
speech, i.e. the line that fits the pitch contour. We also calculated the relative voiced energy. Altogether we
have estimated 43 features for each utterance.
4. We ran RELIEF-F for the s70 data set varying the number of nearest neighbors from 1 to 12, and
ordered features according their sum of ranks.