Page 110 - Biomimetics : Biologically Inspired Technologies

P. 110

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 96 21.9.2005 11:40pm

96 Biomimetics: Biologically Inspired Technologies

lexicons then create expectations in response to C1Fs. The secondary visual layer expectation
symbols then transmit to other secondary lexicons without expectations (if any there be) and to
tertiary lexicons, again using the knowledge links established during training, and C1Fs establish
expectations on all relevant lexicons. Finally, the knowledge links of the third layer are used to
transmit from the tertiary expectations to any lexicons without expectations, followed by a ﬁnal
round of C1Fs.
The expectations formed by this initial ‘‘feedforward’’ interaction represent all of the symbols
that are known (i.e., established by the knowledge) to be compatible with the combinations of the
symbols in the primary lexicon expectations. At this point, a consensus building process is launched
involving all nonnulled lexicons on all layers and all knowledge bases linking those lexicons. This
consensus building process hones all the expectations until each of the involved lexicons has at
most one symbol left (which is, of necessity, active). This collection of symbols is the vision
module’s representation of the attended visual object.
This tertiary visual object representation has three important properties. First, it has signiﬁcant
pose insensitivity. With high probability, if you changed the pose of the object somewhat, almost
the same set of symbols would be obtained as the object’s representation.
Second, the object has been completed; meaning that the representation has removed the effects
of occluding objects that blocked the view of some portions of the object (of course, the visible
portions of the object must be sufﬁcient for completion by this method).
Third, the representation of the object at the lower levels contains details. For example, if the
object is a truck being viewed from the front, the front grille and headlamps will typically be visible
and will be represented at the primary level. Whereas, the representation of the object at the tertiary
level will not have these details. It will be more abstract (many more speciﬁc truck images would
invoke this same, or a very similar, representation).

3.5.5 Linking the Visual Module with the Language Module

Once the visual module is built, what good is it? By itself, not much. It only becomes useful when it
is linked by knowledge with other cognitive modules. This subsection presents a brief sketch of an
example of how, via instruction by a human educator, a vision module could be usefully linked with
a language module.
A problem that has been widely considered is the automated text annotation of video describing
objects within video scenes and some of those object’s attributes. For example, such annotations
might be useful for blind people if the images being annotated were taken by a camera mounted on a
pair of glasses (and the annotations were synthesized into speech provided by the glasses to the
wearer’s ears via small tubes issuing from the temples of the glasses near the ears).
Figure 3.12 illustrates a simple concept for such a text annotation system. Video input from
the eyeglasses-mounted camera is operated upon by the gaze controller and objects that it
selects are segmented and represented by the already-developed visual module, as described
in the previous subsection. The objects that were used in the visual module development
process were those that a blind person would want to be informed of (curbs, roads, cars, people,
etc.). Thus, by virtue of its development, the visual module will search each new frame of video
for an object of operational interest (because these were the objects sought out by the
human educator who’s examples were used to train the gaze controller perceptron) and then that
object will be segmented, and after consensus building, represented by the module on all of its three
layers.
To build the knowledge links from the visual module to the text module, another human
educator is used. This educator looks at each ﬁxation point object selected by the vision module
(while it is being used out on the street in an operationally realistic manner), and if this is indeed an
object that would be of interest to a blind person, types in one to ﬁve sentences describing that

105 106 107 108 109 110 111 112 113 114 115