Page 312 - Concise Encyclopedia of Robotics
P. 312
Speech Synthesis
(www.google.com) or a similar search engine. Related entries include
BANDWIDTH, CONTEXT, DATA CONVERSION, DIGITAL SIGNAL PROCESSING, MESSAGE PASSING,
OPTICAL CHARACTER RECOGNITION, PROSODIC FEATURES, SOUND TRANSDUCER, SPEECH SYNTHESIS,
and SYNTAX.
SPEECH SYNTHESIS
Speech synthesis, also called voice synthesis, is the electronic generation of
sounds that mimic the human voice. These sounds can be generated from
digital text or from printed documents. Speech can also be generated by
high-level computers that have artificial intelligence (AI), in the form of
responses to stimuli or input from humans or other machines.
What is a voice?
All audible sounds consist of combinations of alternating-current (AC)
waves within the frequency range from 20 Hz to 20 kHz. (A frequency
of 1 Hz is one cycle per second; 1 kHz = 1000 Hz.) These take the form of
vibrations in air molecules. The patterns of vibration can be duplicated
as electric currents.
A frequency band of 300 to 3000 Hz is wide enough to convey all the
information, and also all of the emotional content, in any person’s voice.
Therefore, speech synthesizers only need to make sounds within the
range from 300 to 3000 Hz. The challenge is to produce waves at exactly
the right frequencies, at the right times, and in the right phase combina-
tions. The modulation must also be correct, so the intended meaning is
conveyed. In the human voice, the volume and frequency rise and fall in
subtle and precise ways. The slightest change in modulation can make a
tremendous difference in the meaning of what is said. You can tell, even
over the telephone, whether the speaker is anxious, angry, or relaxed. A
request sounds different than a command. A question sounds different
than a declarative statement, even if the words are the same.
Tone of voice
In the English language there are 40 elementary sounds, known as
phonemes. In some languages there are more phonemes than in English;
some languages have fewer phonemes. The exact sound of a phoneme
can vary, depending on what comes before and after it. These variations
are called allophones. There are 128 allophones in English. These can be
strung together in myriad ways.
The inflection, or “tone of voice,” is another variable in speech; it
depends on whether the speaker is angry, sad, scared, happy, or indifferent.
These depend not only on the actual feelings of the speaker, but on age,
gender, upbringing, and other factors. A voice can also have an accent.