Page 110 - Biomimetics : Biologically Inspired Technologies
P. 110

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 96 21.9.2005 11:40pm




                    96                                      Biomimetics: Biologically Inspired Technologies

                    lexicons then create expectations in response to C1Fs. The secondary visual layer expectation
                    symbols then transmit to other secondary lexicons without expectations (if any there be) and to
                    tertiary lexicons, again using the knowledge links established during training, and C1Fs establish
                    expectations on all relevant lexicons. Finally, the knowledge links of the third layer are used to
                    transmit from the tertiary expectations to any lexicons without expectations, followed by a final
                    round of C1Fs.
                       The expectations formed by this initial ‘‘feedforward’’ interaction represent all of the symbols
                    that are known (i.e., established by the knowledge) to be compatible with the combinations of the
                    symbols in the primary lexicon expectations. At this point, a consensus building process is launched
                    involving all nonnulled lexicons on all layers and all knowledge bases linking those lexicons. This
                    consensus building process hones all the expectations until each of the involved lexicons has at
                    most one symbol left (which is, of necessity, active). This collection of symbols is the vision
                    module’s representation of the attended visual object.
                       This tertiary visual object representation has three important properties. First, it has significant
                    pose insensitivity. With high probability, if you changed the pose of the object somewhat, almost
                    the same set of symbols would be obtained as the object’s representation.
                       Second, the object has been completed; meaning that the representation has removed the effects
                    of occluding objects that blocked the view of some portions of the object (of course, the visible
                    portions of the object must be sufficient for completion by this method).
                       Third, the representation of the object at the lower levels contains details. For example, if the
                    object is a truck being viewed from the front, the front grille and headlamps will typically be visible
                    and will be represented at the primary level. Whereas, the representation of the object at the tertiary
                    level will not have these details. It will be more abstract (many more specific truck images would
                    invoke this same, or a very similar, representation).

                    3.5.5 Linking the Visual Module with the Language Module

                    Once the visual module is built, what good is it? By itself, not much. It only becomes useful when it
                    is linked by knowledge with other cognitive modules. This subsection presents a brief sketch of an
                    example of how, via instruction by a human educator, a vision module could be usefully linked with
                    a language module.
                       A problem that has been widely considered is the automated text annotation of video describing
                    objects within video scenes and some of those object’s attributes. For example, such annotations
                    might be useful for blind people if the images being annotated were taken by a camera mounted on a
                    pair of glasses (and the annotations were synthesized into speech provided by the glasses to the
                    wearer’s ears via small tubes issuing from the temples of the glasses near the ears).
                       Figure 3.12 illustrates a simple concept for such a text annotation system. Video input from
                    the eyeglasses-mounted camera is operated upon by the gaze controller and objects that it
                    selects are segmented and represented by the already-developed visual module, as described
                    in the previous subsection. The objects that were used in the visual module development
                    process were those that a blind person would want to be informed of (curbs, roads, cars, people,
                    etc.). Thus, by virtue of its development, the visual module will search each new frame of video
                    for an object of operational interest (because these were the objects sought out by the
                    human educator who’s examples were used to train the gaze controller perceptron) and then that
                    object will be segmented, and after consensus building, represented by the module on all of its three
                    layers.
                       To build the knowledge links from the visual module to the text module, another human
                    educator is used. This educator looks at each fixation point object selected by the vision module
                    (while it is being used out on the street in an operationally realistic manner), and if this is indeed an
                    object that would be of interest to a blind person, types in one to five sentences describing that
   105   106   107   108   109   110   111   112   113   114   115