Page 97 - Biomimetics : Biologically Inspired Technologies
P. 97

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 83 21.9.2005 11:40pm




                    Mechanization of Cognition                                                   83

                    Figure 3.6). Mathematically, the symbols of each primary sound lexicon are a vector quantizer
                    (Zador, 1963) for the set of S vectors that arise, from all sound sources that are likely to occur,
                    when each source is presented in isolation (i.e., no mixtures). Among the symbol sets that are
                    responding to S are some that represent the sounds coming from the attended speaker. This
                    illustrates the critically important need to design the acoustic front-end so as to achieve this sort
                    of quasiorthogonalization of sources. By confining each sound feature to a properly selected time
                    interval (a subinterval of the 8000 samples available at each moment, ending at the most
                    recent 16 kHz sample), and by using the proper postfiltering (after the dot product with the feature
                    vector has been computed) this quasiorthogonalization can be accomplished. (Note: This scheme
                    answers the question of how brains carry out ‘‘independent component analysis’’ [Hyva ¨rinen et al.,
                    2001]. They don’t need to. Properly designed quasiorthogonalizing features, adapted to the pure
                    sound sources that the critter encounters in the real world, map each source of an arbitrary mixture
                    of sources into its own separate components of the S vector. In effect, this is essentially a sort of
                    ‘‘one-time ICA’’ feature development process carried out during development and then essentially
                    frozen (or perhaps adaptively maintained). Given the stream of S vectors, the confabulation
                    processing which follows (as described below) can then, at each moment, ignore all but the attended
                    source-related subset of components, independent of how many, or few, interfering sources
                    are present. Of course, this is exactly what is observed in mammalian audition — effortless
                    segmentation of the attended source at the very first stage of auditory (or visual or somatosensory,
                    etc.) perception.
                      The expectation formed on the next-word acoustic lexicon of Figure 3.7 (which is a huge
                    structure, almost surely implemented in the human brain by a number of physically separate
                    lexicons) is created by successive C1Fs. The first is based on input from the speaker model
                    lexicon. The only symbols (each representing a stored acoustic model for a single word — see
                    below) that then remain available for further use are those connected with the speaker currently
                    being attended to.
                      The second C1F is executed in connection with input from the language module word lexicon
                    that has an expectation on it representing possible predictions of the next word that the speaker will
                    produce (this next-word lexicon expectation is produced using essentially the same process as was
                    described in Section 3.3 in connection with sentence continuation with context). (Note: This is an
                    example of the situation mentioned above and in the Appendix, where an expectation is allowed to
                    transmit through a knowledge base.) After this operation, the only symbols left available for use on
                    the next-word acoustic lexicon are those representing expected words spoken by the attended
                    speaker. This expectation is then used for the processing involved in recognizing the attended
                    speaker’s next word.
                      As shown in Figure 3.7, knowledge bases have previously been established (using pure source,
                    or well-segmented source, examples) to and from the primary sound symbol lexicons with the
                    sound phrase lexicons and to and from these with the next-word acoustic lexicon. Using these
                    knowledge bases, the expectation on the next-word acoustic lexicon is transferred (as described
                    immediately above) via the appropriate knowledge bases, to the sound phrase lexicons, where
                    expectations are formed; and from these to the primary sound lexicons, where additional expect-
                    ations are formed. It is easy to imagine that, since each of these transferred expectations is typically
                    much larger than the one from which it came, that by the time this process gets to the primary sound
                    lexicons, the expectations will encompass almost every symbol. THIS IS NOT SO! While these
                    primary lexicon expectations are indeed large (they may encompass many hundreds of symbols),
                    they are still only a small fraction of the total set of tens of thousands of symbols. Given these
                    transfers, which actually occur as soon as the recognition of the previous word is completed —
                    which is often long before its acoustic content ceases arriving, the architecture is prepared for
                    detecting the next word spoken by the attended speaker.
   92   93   94   95   96   97   98   99   100   101   102