Page 310 - Concise Encyclopedia of Robotics
P. 310

Speech Recognition
                            programmed to make sense out of that? The answer lies in the fact that,
                            whatever you say, it is comprised of only a few dozen basic sounds called
                            phonemes. These phonemes can be identified by computer programs.
                              In communications, a voice can be transmitted if the bandwidth is
                            restricted to the range from 300 to 3000 Hz. Certain phonemes, such as
                            “ssss,”contain energy at frequencies of several kilohertz, but all the infor-
                            mation in a voice, including the emotional content, can be conveyed if the
                            audio passband is cut off at 3000 Hz. This is the typical voice frequency
                            response in a two-way radio.
                              Most of the acoustic energy in a human voice occurs within three
                            defined  frequency  ranges, called  formants. The  first  formant  is  at  less
                            than 1000 Hz. The second formant ranges from approximately 1600 to
                            2000 Hz. The third formant ranges from approximately 2600 to 3000 Hz.
                            Between the formants there are spectral gaps, or ranges of frequencies at
                            which little or no sound occurs. The formants, and the gaps between
                            them, stay in the same frequency ranges no matter what is said. The fine
                            details of the voice print determine not only the words, but all the emo-
                            tions, insinuations, and other aspects of speech. Any change in “tone of
                            voice” shows up in a voice print. Therefore, in theory, it is possible to
                            build a machine that can recognize and analyze speech as well as any
                            human being.
                            A/D Conversion
                            The passband, or range of audio frequencies transmitted in a circuit, can
                            be reduced greatly if you are willing to give up some of the emotional
                            content of the voice, in favor of efficient information transfer. Analog-to-
                            digital conversion accomplishes this.An analog-to-digital converter (ADC)
                            changes the continuously variable, or analog, voice signal into a series of
                            digital pulses. This is a little like the process in which a photograph is con-
                            verted to a grid of dots for printing in the newspaper. There are several
                            different characteristics of a pulse train that can be varied. These include
                            the pulse amplitude, the pulse duration, and the pulse frequency.
                              A digital signal can carry a human voice within a passband less than
                            200 Hz wide. That is less than one-tenth of the passband of the analog
                            signal.The narrower the bandwidth,in general,the more of the emotional
                            content is sacrificed. Emotional content is conveyed by inflection, or varia-
                            tion in voice tone.When inflection is lost, a voice signal resembles a mono-
                            tone. However, it can still carry some of the subtle meanings and feelings.

                            Word analysis
                            For a computer to decipher the digital voice signal, it must have a vocabu-
                            lary of words or syllables, and some means of comparing this knowledge




                                                   
   305   306   307   308   309   310   311   312   313   314   315