Page 81 - Biomimetics : Biologically Inspired Technologies
P. 81
Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 67 21.9.2005 11:40pm
Mechanization of Cognition 67
which represent words which occur later in the temporal sequence of the word string. The first (i.e.,
leftmost) lexicon is connected to all of the 19 lexicons which follow it by 19 individual knowledge
bases. The second lexicon to the 18 lexicons to its right, and so forth. Thus, this architecture has a
total of 20 lexicons and 190 knowledge bases.
The training process starts with the first sentence of the training corpus and marches one
sentence at a time to the last sentence. As each sentence is encountered, it is entered into the
architecture of Figure 3.1 (unless its first 20 words include a word not among the 63,000; in which
case, for this introduction, the sentence is assumed to be skipped) and used for training. The details
of training are now discussed.
At the beginning of training, one hundred and ninety 63,000 63,000 single precision float
matrices are created (one for each knowledge base) and all of their entries are set to zero. In each
knowledge base’s matrix, each row corresponds to a unique source lexicon symbol and each
column corresponds to a unique target lexicon symbol. The indices of the symbols of each lexicon
are arbitrary, but once set, they are frozen forever. These matrices are used initially, during training
on the text corpus, to store the (integer) co-occurrence counts for the (causally) ordered symbol
pairs of each knowledge base. Then, once these counts are accumulated, the matrices are used to
calculate and store the (floating point) p(cjl) antecedent support probabilities. In practice, various
computer science storage schemes for sparse matrices are used (in both RAM and on hard disk) to
keep the total memory cost low.
Given a training sentence, it is entered into the lexicons of the architecture by activating the
symbol representing each word or punctuation of the sentence, in order. Unused trailing lexicons
are left blank (null). Then, each causal symbol pair is recorded in the matrix of the corresponding
knowledge base by incrementing the numeric entry for that particular source symbol (the index of
which determines the row of the entry) and target symbol (the index of which determines the
column of the entry) pair by one.
After, all of the many tens of millions of sentences of the training corpus have been used
(‘‘read’’) for training (i.e., the entire training corpus has been traversed from the first sentence to the
last); the entries (ordered symbol pair co-occurrence counts) in each knowledge base’s matrix are
then used to create the knowledge links of that knowledge base.
Given a knowledge base matrix, what we have traditionally done is to first set to zero any counts
which are below some fixed threshold (e.g., in some experiments three, and in others 25 or even 50).
In effect, such low counts are thereby deemed random and not meaningful. Then, after these low-
frequency co-occurrences have been set to zero, we use the ‘‘column sum’’ of each count matrix to
determine the appearance count c(l) of each target symbol l for a particular knowledge base.
Specifically, if the count of co-occurrences of source symbol c with target symbol l is c(c,l) (i.e.,
the matrix entry in row c and column l), then we set c(l) equal to the column sum of the quantities
c(f,l) over all source lexicon symbols f. Finally, the knowledge link probability p(cjl) is set equal
to c(c,l)/c(l), which approximates the ratio p(cl)/p(l), which by Bayes’ law is equal to p(cjl).
Note that the values of c(c,l), c(l) and p(cjl) for the same two symbols can differ significantly
for different pairs of source and target lexicons within the sentence. This is because the appearances
of particular words at various positions within a sentence differ greatly. For example, essentially no
sentences begin with the uncapitalized word and. Thus, the value of c(c,l) will be zero for every
knowledge base matrix with the first lexicon as its source region and the symbol c ¼ and as the
source symbol. However, for many other pairs of lexicons and target symbols, this value will be
large. (A technical point: these disparities are greatest at the early words of a sentence. At later
positions in a sentence, the p(cjl) values tend to be very much the same for the same displacement
between the lexicons — probably the underlying reason why language can be handled well by a ring
architecture.)
After the p(cjl) knowledge, link probabilities have been created for all 190 knowledge bases
using the above procedure, we have then traditionally set any of these quantities which are below
some small value (e.g., in some experiments 0.0001, in others 0.0002, or even 0.0005) to zero; on