Page 100 - Socially Intelligent Agents Creating Relationships with Computers and Robots
P. 100

Emotion Recognition Agents for Speech Signal                      83

                              System Structure.    The ER system is part of a new generation computer-
                              ized call center that integrates databases, decision support systems, and differ-
                              ent media such as voice messages, e-mail messages and a WWW server into
                              one information space. The system consists of three processes: a wave file
                              monitor, a voice mail center and a message prioritizer. The wave file monitor
                              reads periodically the contents of the voice message directory, compares it to
                              the list of processed messages, and, if a new message is detected, it processes
                              the message and creates a summary and an emotion description file. The sum-
                              mary file contains the following information: five numbers that describe the
                              distribution of emotions, and the length and percentage of silence in the mes-
                              sage. The emotion description file stores data describing the emotional content
                              of each 1–3 second chunk of message. The prioritizer is a process that reads
                              summary files for processed messages, sorts them taking into account their
                              emotional content, length and some other criteria, and suggests an assignment
                              of agents to return back the calls. Finally, it generates a web page, which lists
                              all current assignments. The voice mail center is an additional tool that helps
                              operators and supervisors to visualize the emotional content of voice messages.

                              5.     Conclusion
                                We have explored how well people and computers recognize emotions in
                              speech. Several conclusions can be drawn from the above results. First, de-
                              coding emotions in speech is a complex process that is influenced by cultural,
                              social, and intellectual characteristics of subjects. People are not perfect in
                              decoding even such manifest emotions as anger and happiness. Second, anger
                              is the most recognizable and easier to portray emotion. It is also the most im-
                              portant emotion for business. But anger has numerous variants (for example,
                              hot anger, cold anger, etc.) that can bring variability into acoustic features and
                              dramatically influence the accuracy of recognition. Third, pattern recognition
                              techniques based on neural networks proved to be useful for emotion recogni-
                              tion in speech and for creating customer relationship management systems.

                              Notes
                                1. Each utterance was recorded using a close-talk microphone. The first 100 utterances were recorded
                              at 22-kHz/8 bit and the remaining 600 utterances at 22-kHz/16 bit.
                                2. The rows and the columns represent true and evaluated categories, respectively. For example, the
                              second row says that 11.9% of utterances that were portrayed as happy were evaluated as normal (unemo-
                              tional), 61.4% as true happy, 10.1% as angry, 4.1% as sad, and 12.5% as afraid.
                                3. The speaking rate was calculated as the inverse of the average length of the voiced part of utterance.
                              For all other parameters we calculated the following statistics: mean, standard deviation, minimum, max-
                              imum, and range. Additionally, for F0 the slope was calculated as a linear regression for voiced part of
                              speech, i.e. the line that fits the pitch contour. We also calculated the relative voiced energy. Altogether we
                              have estimated 43 features for each utterance.
                                4. We ran RELIEF-F for the s70 data set varying the number of nearest neighbors from 1 to 12, and
                              ordered features according their sum of ranks.
   95   96   97   98   99   100   101   102   103   104   105