Page 100 - Biomimetics : Biologically Inspired Technologies
P. 100
Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 86 21.9.2005 11:40pm
86 Biomimetics: Biologically Inspired Technologies
generated by a set of lexicons (in frontal cortex) that specialize in storing and recalling action
symbol sequences.
A common objection about this kind of system is that as long as the expectations keep being
met, the process will keep working. However, if even one glitch occurs, it looks like the whole
process will fall apart and stop working. Then, it will somehow have to be restarted (which is not
easy — for example, it may require the listener to somehow get enough signal to noise ratio to allow
a much cruder trick to work). Well, this objection is quite wrong. Even if the next word and the
next-after-that word are not one of the expected ones, this architecture will often recover and
ongoing speechstream word recognition will continue; as we already proved with our crude initial
version (Sagi et al., 2001). A problem that can reliably make this architecture fail is a sudden major
change in the pace of delivery, or a significant brief interruption in delivery. For example, if the
speaker suddenly starts speaking much faster or much slower the mentioned subsystem that
monitors and sets the pace of the architecture’s operation will cause the timing of the consensus
building and word-boundary segmentation to be too far off. Another problem is if the speaker gets
momentarily tongue-tied and inserts a small unexpected sequence of sounds in a word (try this
yourself by smoothly inserting the brief meaningless sound ‘‘BRYKA’’ in the middle of a word at a
cocktail party — the listener’s Figure 3.7 architecture will fail and they will be forced to move
closer to get clean recognitions to get it going again).
A strong tradition in speech recognition technology is an insistence that speech recognizers be
‘‘time-warp insensitive’’ (i.e., insensitive to changes in the pace of word delivery). Well Figure 3.7
certainly is not strongly ‘‘time-warp insensitive,’’ and as pointed out immediately above, neither are
humans! However, modest levels of time warp have no impact, since this just changes the location
of the phrase region (moves it slightly left or right of its nominal position) where a particular phrase
gets detected. Also note that since honed phrase expectations are transferred, it is not necessary for
all of the primary sound symbols of a phrase to be present in order for that phrase to contribute
significantly to the ‘‘promotion’’ of the next-word acoustic lexicon symbols that receive links from
it. Thus, many primary symbols can be missed with no effect on correct word recognition. This is
one of the things which happens when we speak more quickly: some intermediate sounds are left
out. For example, say Worcestershire sauce at different speeds from slow to fast and consider the
changes in the sounds you issue.
3.4.3 Discussion
This section has outlined how sound input can be transduced into a symbol stream (actually, an
expectation stream) and how that stream can, through a consensus building process, be interpreted
as a sequence of words being emitted by an attended speaker.
One of the many Achilles’ heels of past speech transcription systems has been the use of a
vector quantizer in the sound-processing front end. This is a device that is roughly the same as the
sound feature bank described in this section, except that its output is one and only one symbol
at each time step (10 ms). This makes it impossible for such systems to deal with multi-source
audio scenes.
The sound processing design described in this section also overcomes the inability of past
speech recognition systems to exploit long-range context. Even the best of today’s speech recog-
nizers, operating in a totally noise-free environment with a highly cooperative speaker, cannot
achieve much better than 96% sustained accuracy with vocabularies over 60,000 words. This is
primarily because of the lack of a way to exploit long-range context from previous words in the
current sentence and from previous sentences. In contrast, the system described here has full access
to the context-exploitation methods discussed in Section 3.3; which can be extended to arbitrarily
large bodies of context.
Building a speech recognizer for colloquial speech is much more difficult than for proper
language. As is well known, children essentially cannot learn to understand speech unless they