Page 94 - Biomimetics : Biologically Inspired Technologies

P. 94

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 80 21.9.2005 11:40pm

80 Biomimetics: Biologically Inspired Technologies

transduction process; which is necessarily different for each of these cognitive modalities. Readers
are expected to have a solid understanding of traditional speech signal processing and speech
recognition.

3.4.1 Representation of Multi-Source Soundstreams

Figure 3.5 illustrates an ‘‘audio front end’’ for transduction of a soundstream into a string of ‘‘multi-
symbols;’’ with a goal of carrying out ultra-high-accuracy speech transcription for a single speaker
embedded in multiple interfering sound sources (often including other speakers). The description of
this design does not concern itself with computational efﬁciency. Given a concrete design for such a
system, there are many well-known signal processing techniques for implementing approximately
the same function, often orders of magnitude more efﬁciently. For the purpose of this introductory
treatment (which, again, is aimed at illustrating the universality of confabulation as the mechan-
ization of cognition), this audio front-end design does not incorporate embellishments such as
binaural audio imaging.
Referring to Figure 3.5, the ﬁrst step in processing is analog speech lowpass ﬁltering (say, with a
ﬂat, zero-phase-distortion response from DC to 4 kHz, with a steep rolloff thereafter) of the high-
quality (say, over 110 dB dynamic range) analog microphone input. Following bandpass ﬁltering,
the microphone signal is sampled with an (e.g., 24-bit) analog to digital converter operating at a
16 kHz sample rate. The combination of high-quality analog ﬁltering, sufﬁcient sample rate (well
above the Nyquist rate of 8 kHz) and high dynamic range, yield a digital output stream with almost
no artifacts (and low information loss). Note that digitizing to 24 bits supports exploitation of the
wide dynamic ranges of modern high-quality microphones. In other words, this dynamic range will
make it possible to accurately understand the speech of the attended speaker, even if there are much
higher amplitude interferers present in the soundstream.
The 16 kHz stream of 24-bit signed integer samples generated by the above preprocessing (see
Figure 3.5) is next converted to ﬂoating point numbers and blocked up in time sequence into 8000-
sample windows (8000-dimensional ﬂoating point vectors), at a rate of one window for every
10 ms. Each such sound sample vector X thus overlaps the previous such vector by 98% of its length
(7840 samples). In other words, each X vector contains 160 new samples that were not in the
previous X vector (and the ‘‘oldest’’ 160 samples in that previous vector have ‘‘dropped off the
left end’’).

Figure 3.5 An audio front-end for representation of a multi-source soundstream. See text for details.

89 90 91 92 93 94 95 96 97 98 99