Page 313 - Concise Encyclopedia of Robotics
P. 313

Speech Synthesis
                            You can probably tell when a person speaking to you is angry or happy,
                            regardless  of whether  that  person  is  from  Texas, Indiana, Idaho, or
                            Maine. However, some accents sound more authoritative than others;
                            some sound funny if you have not been exposed to them before. Along
                            with accent, the choice of word usage varies in different regions. This is
                            dialect. For  robotics  engineers, producing  a  speech  synthesizer  with  a
                            credible “tone of voice” is a challenge.
                            Record and playback
                            The most primitive form of speech synthesizer is a set of tape recordings
                            of individual words.You have heard these in automatic telephone answering
                            machines and services. Most cities have a telephone number you can call
                            to get local time; some of these are word recordings. They all have a char-
                            acteristic choppy, interrupted sound.
                              There are several drawbacks to these systems. Perhaps the biggest prob-
                            lem is the fact that each word requires a separate recording, on a separate
                            length of tape. These tapes must be mechanically accessed, and this takes
                            time. It is impossible to have a large speech vocabulary using this method.
                            Reading text
                            Printed text can be read by a machine using optical character recognition
                            (OCR), and converted into a standard digital code called ASCII (pro-
                            nounced “ASK-ee”). The ASCII can be translated by a digital-to-analog
                            converter (DAC) into voice sounds. In this way, a machine can read text
                            out loud. Although they are rather expensive at the time of this writing,
                            these machines are being used to help blind people read printed text.
                              Because there are only 128 allophones in the English language, a
                            machine can be designed to read almost any text. However, machines lack
                            a sense of which inflections are best for the different scenes that come up
                            in a story. With technical or scientific text, this is rarely a problem, but in
                            reading a story to a child, mental imagery is crucial. It is like an imaginary
                            movie, and it is helped along by the emotions of the reader. No machine
                            yet devised can paint pictures, or elicit moods, in a listener’s mind as well
                            as a human being. These things are apparent from context. The tone of a
                            sentence  might  depend  on  what  happened  in  the  previous  sentence,
                            paragraph, or chapter. Technology is a long way from giving a machine
                            the ability to understand, and appreciate, a good story, but nothing short
                            of that level of AI will produce a vivid “story movie” in a listener’s mind.

                            The process
                            There are several ways in which a machine can be programmed to pro-
                            duce speech. A simplified block diagram of one process is shown in the




                                                   
   308   309   310   311   312   313   314   315   316   317   318