Page 100 - Biomimetics : Biologically Inspired Technologies
P. 100

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 86 21.9.2005 11:40pm




                    86                                      Biomimetics: Biologically Inspired Technologies

                    generated by a set of lexicons (in frontal cortex) that specialize in storing and recalling action
                    symbol sequences.
                       A common objection about this kind of system is that as long as the expectations keep being
                    met, the process will keep working. However, if even one glitch occurs, it looks like the whole
                    process will fall apart and stop working. Then, it will somehow have to be restarted (which is not
                    easy — for example, it may require the listener to somehow get enough signal to noise ratio to allow
                    a much cruder trick to work). Well, this objection is quite wrong. Even if the next word and the
                    next-after-that word are not one of the expected ones, this architecture will often recover and
                    ongoing speechstream word recognition will continue; as we already proved with our crude initial
                    version (Sagi et al., 2001). A problem that can reliably make this architecture fail is a sudden major
                    change in the pace of delivery, or a significant brief interruption in delivery. For example, if the
                    speaker suddenly starts speaking much faster or much slower the mentioned subsystem that
                    monitors and sets the pace of the architecture’s operation will cause the timing of the consensus
                    building and word-boundary segmentation to be too far off. Another problem is if the speaker gets
                    momentarily tongue-tied and inserts a small unexpected sequence of sounds in a word (try this
                    yourself by smoothly inserting the brief meaningless sound ‘‘BRYKA’’ in the middle of a word at a
                    cocktail party — the listener’s Figure 3.7 architecture will fail and they will be forced to move
                    closer to get clean recognitions to get it going again).
                       A strong tradition in speech recognition technology is an insistence that speech recognizers be
                    ‘‘time-warp insensitive’’ (i.e., insensitive to changes in the pace of word delivery). Well Figure 3.7
                    certainly is not strongly ‘‘time-warp insensitive,’’ and as pointed out immediately above, neither are
                    humans! However, modest levels of time warp have no impact, since this just changes the location
                    of the phrase region (moves it slightly left or right of its nominal position) where a particular phrase
                    gets detected. Also note that since honed phrase expectations are transferred, it is not necessary for
                    all of the primary sound symbols of a phrase to be present in order for that phrase to contribute
                    significantly to the ‘‘promotion’’ of the next-word acoustic lexicon symbols that receive links from
                    it. Thus, many primary symbols can be missed with no effect on correct word recognition. This is
                    one of the things which happens when we speak more quickly: some intermediate sounds are left
                    out. For example, say Worcestershire sauce at different speeds from slow to fast and consider the
                    changes in the sounds you issue.

                    3.4.3 Discussion

                    This section has outlined how sound input can be transduced into a symbol stream (actually, an
                    expectation stream) and how that stream can, through a consensus building process, be interpreted
                    as a sequence of words being emitted by an attended speaker.
                       One of the many Achilles’ heels of past speech transcription systems has been the use of a
                    vector quantizer in the sound-processing front end. This is a device that is roughly the same as the
                    sound feature bank described in this section, except that its output is one and only one symbol
                    at each time step (10 ms). This makes it impossible for such systems to deal with multi-source
                    audio scenes.
                       The sound processing design described in this section also overcomes the inability of past
                    speech recognition systems to exploit long-range context. Even the best of today’s speech recog-
                    nizers, operating in a totally noise-free environment with a highly cooperative speaker, cannot
                    achieve much better than 96% sustained accuracy with vocabularies over 60,000 words. This is
                    primarily because of the lack of a way to exploit long-range context from previous words in the
                    current sentence and from previous sentences. In contrast, the system described here has full access
                    to the context-exploitation methods discussed in Section 3.3; which can be extended to arbitrarily
                    large bodies of context.
                       Building a speech recognizer for colloquial speech is much more difficult than for proper
                    language. As is well known, children essentially cannot learn to understand speech unless they
   95   96   97   98   99   100   101   102   103   104   105