Page 81 - Biomimetics : Biologically Inspired Technologies
P. 81

Bar-Cohen : Biomimetics: Biologically Inspired Technologies DK3163_c003 Final Proof page 67 21.9.2005 11:40pm




                    Mechanization of Cognition                                                   67

                    which represent words which occur later in the temporal sequence of the word string. The first (i.e.,
                    leftmost) lexicon is connected to all of the 19 lexicons which follow it by 19 individual knowledge
                    bases. The second lexicon to the 18 lexicons to its right, and so forth. Thus, this architecture has a
                    total of 20 lexicons and 190 knowledge bases.
                      The training process starts with the first sentence of the training corpus and marches one
                    sentence at a time to the last sentence. As each sentence is encountered, it is entered into the
                    architecture of Figure 3.1 (unless its first 20 words include a word not among the 63,000; in which
                    case, for this introduction, the sentence is assumed to be skipped) and used for training. The details
                    of training are now discussed.
                      At the beginning of training, one hundred and ninety 63,000   63,000 single precision float
                    matrices are created (one for each knowledge base) and all of their entries are set to zero. In each
                    knowledge base’s matrix, each row corresponds to a unique source lexicon symbol and each
                    column corresponds to a unique target lexicon symbol. The indices of the symbols of each lexicon
                    are arbitrary, but once set, they are frozen forever. These matrices are used initially, during training
                    on the text corpus, to store the (integer) co-occurrence counts for the (causally) ordered symbol
                    pairs of each knowledge base. Then, once these counts are accumulated, the matrices are used to
                    calculate and store the (floating point) p(cjl) antecedent support probabilities. In practice, various
                    computer science storage schemes for sparse matrices are used (in both RAM and on hard disk) to
                    keep the total memory cost low.
                      Given a training sentence, it is entered into the lexicons of the architecture by activating the
                    symbol representing each word or punctuation of the sentence, in order. Unused trailing lexicons
                    are left blank (null). Then, each causal symbol pair is recorded in the matrix of the corresponding
                    knowledge base by incrementing the numeric entry for that particular source symbol (the index of
                    which determines the row of the entry) and target symbol (the index of which determines the
                    column of the entry) pair by one.
                      After, all of the many tens of millions of sentences of the training corpus have been used
                    (‘‘read’’) for training (i.e., the entire training corpus has been traversed from the first sentence to the
                    last); the entries (ordered symbol pair co-occurrence counts) in each knowledge base’s matrix are
                    then used to create the knowledge links of that knowledge base.
                      Given a knowledge base matrix, what we have traditionally done is to first set to zero any counts
                    which are below some fixed threshold (e.g., in some experiments three, and in others 25 or even 50).
                    In effect, such low counts are thereby deemed random and not meaningful. Then, after these low-
                    frequency co-occurrences have been set to zero, we use the ‘‘column sum’’ of each count matrix to
                    determine the appearance count c(l) of each target symbol l for a particular knowledge base.
                    Specifically, if the count of co-occurrences of source symbol c with target symbol l is c(c,l) (i.e.,
                    the matrix entry in row c and column l), then we set c(l) equal to the column sum of the quantities
                    c(f,l) over all source lexicon symbols f. Finally, the knowledge link probability p(cjl) is set equal
                    to c(c,l)/c(l), which approximates the ratio p(cl)/p(l), which by Bayes’ law is equal to p(cjl).
                      Note that the values of c(c,l), c(l) and p(cjl) for the same two symbols can differ significantly
                    for different pairs of source and target lexicons within the sentence. This is because the appearances
                    of particular words at various positions within a sentence differ greatly. For example, essentially no
                    sentences begin with the uncapitalized word and. Thus, the value of c(c,l) will be zero for every
                    knowledge base matrix with the first lexicon as its source region and the symbol c ¼ and as the
                    source symbol. However, for many other pairs of lexicons and target symbols, this value will be
                    large. (A technical point: these disparities are greatest at the early words of a sentence. At later
                    positions in a sentence, the p(cjl) values tend to be very much the same for the same displacement
                    between the lexicons — probably the underlying reason why language can be handled well by a ring
                    architecture.)
                      After the p(cjl) knowledge, link probabilities have been created for all 190 knowledge bases
                    using the above procedure, we have then traditionally set any of these quantities which are below
                    some small value (e.g., in some experiments 0.0001, in others 0.0002, or even 0.0005) to zero; on
   76   77   78   79   80   81   82   83   84   85   86