Page 310 - Concise Encyclopedia of Robotics
P. 310
Speech Recognition
programmed to make sense out of that? The answer lies in the fact that,
whatever you say, it is comprised of only a few dozen basic sounds called
phonemes. These phonemes can be identified by computer programs.
In communications, a voice can be transmitted if the bandwidth is
restricted to the range from 300 to 3000 Hz. Certain phonemes, such as
“ssss,”contain energy at frequencies of several kilohertz, but all the infor-
mation in a voice, including the emotional content, can be conveyed if the
audio passband is cut off at 3000 Hz. This is the typical voice frequency
response in a two-way radio.
Most of the acoustic energy in a human voice occurs within three
defined frequency ranges, called formants. The first formant is at less
than 1000 Hz. The second formant ranges from approximately 1600 to
2000 Hz. The third formant ranges from approximately 2600 to 3000 Hz.
Between the formants there are spectral gaps, or ranges of frequencies at
which little or no sound occurs. The formants, and the gaps between
them, stay in the same frequency ranges no matter what is said. The fine
details of the voice print determine not only the words, but all the emo-
tions, insinuations, and other aspects of speech. Any change in “tone of
voice” shows up in a voice print. Therefore, in theory, it is possible to
build a machine that can recognize and analyze speech as well as any
human being.
A/D Conversion
The passband, or range of audio frequencies transmitted in a circuit, can
be reduced greatly if you are willing to give up some of the emotional
content of the voice, in favor of efficient information transfer. Analog-to-
digital conversion accomplishes this.An analog-to-digital converter (ADC)
changes the continuously variable, or analog, voice signal into a series of
digital pulses. This is a little like the process in which a photograph is con-
verted to a grid of dots for printing in the newspaper. There are several
different characteristics of a pulse train that can be varied. These include
the pulse amplitude, the pulse duration, and the pulse frequency.
A digital signal can carry a human voice within a passband less than
200 Hz wide. That is less than one-tenth of the passband of the analog
signal.The narrower the bandwidth,in general,the more of the emotional
content is sacrificed. Emotional content is conveyed by inflection, or varia-
tion in voice tone.When inflection is lost, a voice signal resembles a mono-
tone. However, it can still carry some of the subtle meanings and feelings.
Word analysis
For a computer to decipher the digital voice signal, it must have a vocabu-
lary of words or syllables, and some means of comparing this knowledge