Page 308 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 308

4. Evolution of LSTM Architectures   301




                  network design. The CoDeepNEAT method incorporates both approaches: neuro-
                  evolution searches for both new LSTM units and their connectivity across multiple
                  layers at thesametime.
                     CoDeepNEATwas slightly modified to make it easier to find novel connectivities
                  between LSTM layers. Multiple LSTM layers are flattened into a neural network
                  graph that is then modified by neuroevolution. There are two types of mutations:
                  one enables or disables a connection between LSTM layers, and the other adds or
                  removes skip connections between two LSTM nodes. Recently, skip connections
                  have led to performance improvements in deep neural networks, which suggests
                  that they could be useful for LSTM networks as well. Thus, neuroevolution modifies
                  both the high-level network topology and the low-level LSTM connections.
                     In each generation, a population of these network graphs (i.e., blueprints), con-
                  sisting of LSTM variants (i.e., modules with possible skip connections), is created.
                  The individual networks are then trained and tested with the supervised data of the
                  task. The experimental setup and the language modeling task are described next.

                  4.2 EVOLVING DNNs IN THE LANGUAGE MODELING BENCHMARK

                  One standard benchmark task for LSTM network is language modeling, that is, pre-
                  dicting the next word in a large text corpus. The benchmark utilizes the Penn Tree
                  Bank (PTB) dataset [44], which consists of 929k training words, 73k validation
                  words, and 82k test words. It has 10k words in its vocabulary.
                     A population of 50 LSTM networks was initialized with uniformly random
                  initial connection weights within [ 0.05, 0.05]. Each network consisted of two
                  recurrent layers (vanilla LSTM or its variants) with 650 hidden nodes in each layer.
                  The network was unrolled in time upto 35 steps. The hidden states were initialized to
                  zero. The final hidden states of the current minibatch was used as the initial hidden
                  state of the subsequent minibatch (successive minibatches sequentially traverse the
                  training set). The size of each minibatch was 20. For fitness evaluation, each network
                  was trained for 39 epochs. A learning rate decay of 0.8 was applied at the end of
                  every six epochs; the dropout rate was 0.5. The gradients were clipped if their
                  maximum norm (normalized by minibatch size) exceeded 5. Training a single
                  network took about 200 min on a GeForce GTX 980 GPU card.
                     After 25 generations of neuroevolution, the best network improved the perfor-
                  mance on PTB dataset by 5% (test-perplexity score 78) as compared to the vanilla
                  LSTM [45]. As shown in Fig. 15.3, this LSTM variant consists of a feedback skip
                  connection between the memory cells of two LSTM layers. This result is interesting
                  because it is similar to a recent hand-designed architecture that also outperforms va-
                  nilla LSTM [41].
                     The initial results thus demonstrate that CoDeepNEAT with just two LSTM-
                  specific mutations can automatically discover improved LSTM variants. It is likely
                  that expanding the search space with more mutation types and layer and connection
                  types would lead to further improvements.
   303   304   305   306   307   308   309   310   311   312   313