Page 95 - Biomimetics : Biologically Inspired Technologies
P. 95
Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 81 21.9.2005 11:40pm
Mechanization of Cognition 81
As shown in Figure 3.5, the 100 Hz stream of sound sample vectors then proceeds to a sound
feature bank. This device is based upon a collection of L fixed, 8000-dimensional floating point
feature vectors:K 1 ,K 2 , . . .,K L (where L is typically a few tens of thousands). These feature
vectors represent a variety of sound detection correlation kernels. For example: gammatone
wavelets with a wide variety of frequencies, phases, and gamma envelope lengths, broadband
impulse detectors; fricative detectors; etc. When a sound sample vector X arrives at the feature bank
the first step is to take the inner product of X with each of the L feature vectors; yielding L real
numbers: (X K 1 ), (X K 2 ), .. .,(X K L ). These L values form the raw feature response vector. The
individual components of the raw feature response vector are then each subjected to further
processing (e.g., discrete time linear or quasi-linear filtering), which is customized for each of
the L components. Finally, the logarithm of the square of each component of this vector is taken.
The net output of the sound feature bank is an L-component non-negative primary sound symbol
excitation vector S (see Figure 3.5). A new S vector is issued in every 10 ms.
The criteria used in selection of the feature vectors are low information loss, sparse represen-
tation (a relatively small percentage of S components meaningfully above zero at any time due to
any single sound source), and low rate of individual feature response to multiple sources. By this
latter it is meant that, given a typical application mix of sources, the probability of any feature
which is meaningfully responding to the incoming soundstream at a particular time being stimu-
lated (at that moment) by sounds from more than one source in the auditory scene is low. The net
result of these properties is that S vectors tend to have few meaningfully nonzero components per
source, and each sound symbol with a significant excitation is responding to only one sound source
(see Sagi et al., 2001 for a concrete example of a sound feature bank).
Figure 3.6 illustrates a typical primary sound symbol excitation vector S. This is the mechanism
of analog sound input transduction into the world of symbols. A new S vector is created 100 times
per second. S describes the content of the sound scene being monitored by the microphone at that
moment. Each of the L components of S (again, L is typically tens of thousands) represents the
response of one sound feature detector (as described above) to this current sonic scene.
S is composed of small, mostly disjoint (but usually not contiguous), subsets of excited sound
symbol components — one subset for each sound source in the current auditory scene. Again, each
excited symbol is typically responding to the sound emanating from only one of the sound sources
in the audio scene being monitored by the microphone. While this single-source-per-excited-
symbol rule is not strictly true all the time, it is almost always true (which, as we will see, is all
that matters). Thus, if at each moment, we could somehow decide which subset of excited symbols
of the symbol excitation vector to pay attention to, we could ignore the other symbols and thereby
focus our attention on one source. That is the essence of all initial cortical sensory processing
(auditory, visual, gustatory, olfactory, and somatosensory): figuring out, in real-time, which
primary sensor input representation symbols to pay attention to, and ignoring the rest. This
ubiquitous cognitive process is termed attended object segmentation.
Figure 3.6 Illustration of the properties of a primary sound symbol excitation vector S (only a few of the L
components of S are shown). Excited symbols have thicker circles. Each of the four sound sources present (at the
moment illustrated) in the auditory scene being monitored is causing a relatively small subset of feature symbols to
be excited. Note that the symbols excited by sources 1 and 3 are not contiguous. That is typical. Keep in mind that
the number of symbols, L (which is equal to the number of feature vectors) is typically tens of thousands; of which
only a small fraction are meaningfully excited. This is because each sound source only excites a relatively small
number of sound features at each moment and typical audio scenes contain only a relatively small number of sound
sources (typically fewer than 20 monaurally distinguishable sources).