Page 308 - Artificial Intelligence in the Age of Neural Networks and Brain Computing

P. 308

4. Evolution of LSTM Architectures 301

network design. The CoDeepNEAT method incorporates both approaches: neuro-
evolution searches for both new LSTM units and their connectivity across multiple
layers at thesametime.
CoDeepNEATwas slightly modiﬁed to make it easier to ﬁnd novel connectivities
between LSTM layers. Multiple LSTM layers are ﬂattened into a neural network
graph that is then modiﬁed by neuroevolution. There are two types of mutations:
one enables or disables a connection between LSTM layers, and the other adds or
removes skip connections between two LSTM nodes. Recently, skip connections
have led to performance improvements in deep neural networks, which suggests
that they could be useful for LSTM networks as well. Thus, neuroevolution modiﬁes
both the high-level network topology and the low-level LSTM connections.
In each generation, a population of these network graphs (i.e., blueprints), con-
sisting of LSTM variants (i.e., modules with possible skip connections), is created.
The individual networks are then trained and tested with the supervised data of the
task. The experimental setup and the language modeling task are described next.

4.2 EVOLVING DNNs IN THE LANGUAGE MODELING BENCHMARK

One standard benchmark task for LSTM network is language modeling, that is, pre-
dicting the next word in a large text corpus. The benchmark utilizes the Penn Tree
Bank (PTB) dataset [44], which consists of 929k training words, 73k validation
words, and 82k test words. It has 10k words in its vocabulary.
A population of 50 LSTM networks was initialized with uniformly random
initial connection weights within [ 0.05, 0.05]. Each network consisted of two
recurrent layers (vanilla LSTM or its variants) with 650 hidden nodes in each layer.
The network was unrolled in time upto 35 steps. The hidden states were initialized to
zero. The ﬁnal hidden states of the current minibatch was used as the initial hidden
state of the subsequent minibatch (successive minibatches sequentially traverse the
training set). The size of each minibatch was 20. For ﬁtness evaluation, each network
was trained for 39 epochs. A learning rate decay of 0.8 was applied at the end of
every six epochs; the dropout rate was 0.5. The gradients were clipped if their
maximum norm (normalized by minibatch size) exceeded 5. Training a single
network took about 200 min on a GeForce GTX 980 GPU card.
After 25 generations of neuroevolution, the best network improved the perfor-
mance on PTB dataset by 5% (test-perplexity score 78) as compared to the vanilla
LSTM [45]. As shown in Fig. 15.3, this LSTM variant consists of a feedback skip
connection between the memory cells of two LSTM layers. This result is interesting
because it is similar to a recent hand-designed architecture that also outperforms va-
nilla LSTM [41].
The initial results thus demonstrate that CoDeepNEAT with just two LSTM-
speciﬁc mutations can automatically discover improved LSTM variants. It is likely
that expanding the search space with more mutation types and layer and connection
types would lead to further improvements.

303 304 305 306 307 308 309 310 311 312 313