Page 95 - Biomimetics : Biologically Inspired Technologies
P. 95

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 81 21.9.2005 11:40pm




                    Mechanization of Cognition                                                   81

                      As shown in Figure 3.5, the 100 Hz stream of sound sample vectors then proceeds to a sound
                    feature bank. This device is based upon a collection of L fixed, 8000-dimensional floating point
                    feature vectors:K 1 ,K 2 , . . .,K L (where L is typically a few tens of thousands). These feature
                    vectors represent a variety of sound detection correlation kernels. For example: gammatone
                    wavelets with a wide variety of frequencies, phases, and gamma envelope lengths, broadband
                    impulse detectors; fricative detectors; etc. When a sound sample vector X arrives at the feature bank
                    the first step is to take the inner product of X with each of the L feature vectors; yielding L real
                    numbers: (X   K 1 ), (X   K 2 ), .. .,(X   K L ). These L values form the raw feature response vector. The
                    individual components of the raw feature response vector are then each subjected to further
                    processing (e.g., discrete time linear or quasi-linear filtering), which is customized for each of
                    the L components. Finally, the logarithm of the square of each component of this vector is taken.
                    The net output of the sound feature bank is an L-component non-negative primary sound symbol
                    excitation vector S (see Figure 3.5). A new S vector is issued in every 10 ms.
                      The criteria used in selection of the feature vectors are low information loss, sparse represen-
                    tation (a relatively small percentage of S components meaningfully above zero at any time due to
                    any single sound source), and low rate of individual feature response to multiple sources. By this
                    latter it is meant that, given a typical application mix of sources, the probability of any feature
                    which is meaningfully responding to the incoming soundstream at a particular time being stimu-
                    lated (at that moment) by sounds from more than one source in the auditory scene is low. The net
                    result of these properties is that S vectors tend to have few meaningfully nonzero components per
                    source, and each sound symbol with a significant excitation is responding to only one sound source
                    (see Sagi et al., 2001 for a concrete example of a sound feature bank).
                      Figure 3.6 illustrates a typical primary sound symbol excitation vector S. This is the mechanism
                    of analog sound input transduction into the world of symbols. A new S vector is created 100 times
                    per second. S describes the content of the sound scene being monitored by the microphone at that
                    moment. Each of the L components of S (again, L is typically tens of thousands) represents the
                    response of one sound feature detector (as described above) to this current sonic scene.
                      S is composed of small, mostly disjoint (but usually not contiguous), subsets of excited sound
                    symbol components — one subset for each sound source in the current auditory scene. Again, each
                    excited symbol is typically responding to the sound emanating from only one of the sound sources
                    in the audio scene being monitored by the microphone. While this single-source-per-excited-
                    symbol rule is not strictly true all the time, it is almost always true (which, as we will see, is all
                    that matters). Thus, if at each moment, we could somehow decide which subset of excited symbols
                    of the symbol excitation vector to pay attention to, we could ignore the other symbols and thereby
                    focus our attention on one source. That is the essence of all initial cortical sensory processing
                    (auditory, visual, gustatory, olfactory, and somatosensory): figuring out, in real-time, which
                    primary sensor input representation symbols to pay attention to, and ignoring the rest. This
                    ubiquitous cognitive process is termed attended object segmentation.









                    Figure 3.6  Illustration of the properties of a primary sound symbol excitation vector S (only a few of the L
                    components of S are shown). Excited symbols have thicker circles. Each of the four sound sources present (at the
                    moment illustrated) in the auditory scene being monitored is causing a relatively small subset of feature symbols to
                    be excited. Note that the symbols excited by sources 1 and 3 are not contiguous. That is typical. Keep in mind that
                    the number of symbols, L (which is equal to the number of feature vectors) is typically tens of thousands; of which
                    only a small fraction are meaningfully excited. This is because each sound source only excites a relatively small
                    number of sound features at each moment and typical audio scenes contain only a relatively small number of sound
                    sources (typically fewer than 20 monaurally distinguishable sources).
   90   91   92   93   94   95   96   97   98   99   100