Page 313 - Concise Encyclopedia of Robotics
P. 313
Speech Synthesis
You can probably tell when a person speaking to you is angry or happy,
regardless of whether that person is from Texas, Indiana, Idaho, or
Maine. However, some accents sound more authoritative than others;
some sound funny if you have not been exposed to them before. Along
with accent, the choice of word usage varies in different regions. This is
dialect. For robotics engineers, producing a speech synthesizer with a
credible “tone of voice” is a challenge.
Record and playback
The most primitive form of speech synthesizer is a set of tape recordings
of individual words.You have heard these in automatic telephone answering
machines and services. Most cities have a telephone number you can call
to get local time; some of these are word recordings. They all have a char-
acteristic choppy, interrupted sound.
There are several drawbacks to these systems. Perhaps the biggest prob-
lem is the fact that each word requires a separate recording, on a separate
length of tape. These tapes must be mechanically accessed, and this takes
time. It is impossible to have a large speech vocabulary using this method.
Reading text
Printed text can be read by a machine using optical character recognition
(OCR), and converted into a standard digital code called ASCII (pro-
nounced “ASK-ee”). The ASCII can be translated by a digital-to-analog
converter (DAC) into voice sounds. In this way, a machine can read text
out loud. Although they are rather expensive at the time of this writing,
these machines are being used to help blind people read printed text.
Because there are only 128 allophones in the English language, a
machine can be designed to read almost any text. However, machines lack
a sense of which inflections are best for the different scenes that come up
in a story. With technical or scientific text, this is rarely a problem, but in
reading a story to a child, mental imagery is crucial. It is like an imaginary
movie, and it is helped along by the emotions of the reader. No machine
yet devised can paint pictures, or elicit moods, in a listener’s mind as well
as a human being. These things are apparent from context. The tone of a
sentence might depend on what happened in the previous sentence,
paragraph, or chapter. Technology is a long way from giving a machine
the ability to understand, and appreciate, a good story, but nothing short
of that level of AI will produce a vivid “story movie” in a listener’s mind.
The process
There are several ways in which a machine can be programmed to pro-
duce speech. A simplified block diagram of one process is shown in the