Page 94 - Biomimetics : Biologically Inspired Technologies
P. 94

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 80 21.9.2005 11:40pm




                    80                                      Biomimetics: Biologically Inspired Technologies

                    transduction process; which is necessarily different for each of these cognitive modalities. Readers
                    are expected to have a solid understanding of traditional speech signal processing and speech
                    recognition.

                    3.4.1 Representation of Multi-Source Soundstreams

                    Figure 3.5 illustrates an ‘‘audio front end’’ for transduction of a soundstream into a string of ‘‘multi-
                    symbols;’’ with a goal of carrying out ultra-high-accuracy speech transcription for a single speaker
                    embedded in multiple interfering sound sources (often including other speakers). The description of
                    this design does not concern itself with computational efficiency. Given a concrete design for such a
                    system, there are many well-known signal processing techniques for implementing approximately
                    the same function, often orders of magnitude more efficiently. For the purpose of this introductory
                    treatment (which, again, is aimed at illustrating the universality of confabulation as the mechan-
                    ization of cognition), this audio front-end design does not incorporate embellishments such as
                    binaural audio imaging.
                       Referring to Figure 3.5, the first step in processing is analog speech lowpass filtering (say, with a
                    flat, zero-phase-distortion response from DC to 4 kHz, with a steep rolloff thereafter) of the high-
                    quality (say, over 110 dB dynamic range) analog microphone input. Following bandpass filtering,
                    the microphone signal is sampled with an (e.g., 24-bit) analog to digital converter operating at a
                    16 kHz sample rate. The combination of high-quality analog filtering, sufficient sample rate (well
                    above the Nyquist rate of 8 kHz) and high dynamic range, yield a digital output stream with almost
                    no artifacts (and low information loss). Note that digitizing to 24 bits supports exploitation of the
                    wide dynamic ranges of modern high-quality microphones. In other words, this dynamic range will
                    make it possible to accurately understand the speech of the attended speaker, even if there are much
                    higher amplitude interferers present in the soundstream.
                       The 16 kHz stream of 24-bit signed integer samples generated by the above preprocessing (see
                    Figure 3.5) is next converted to floating point numbers and blocked up in time sequence into 8000-
                    sample windows (8000-dimensional floating point vectors), at a rate of one window for every
                    10 ms. Each such sound sample vector X thus overlaps the previous such vector by 98% of its length
                    (7840 samples). In other words, each X vector contains 160 new samples that were not in the
                    previous X vector (and the ‘‘oldest’’ 160 samples in that previous vector have ‘‘dropped off the
                    left end’’).


























                    Figure 3.5  An audio front-end for representation of a multi-source soundstream. See text for details.
   89   90   91   92   93   94   95   96   97   98   99