Page 310 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 310

5. Application Case Study: Image Captioning for the Blind  303




                  5.1 EVOLVING DNNs FOR IMAGE CAPTIONING
                  Deep learning has recently provided state-of-the-art performance in image
                  captioning, and several diverse architectures have been suggested [47e51].The
                  input to an image captioning system is a raw image, and the output is a text caption
                  intended to describe the contents of the image. In deep learning approaches, a con-
                  volutional network is usually used to process the image, and recurrent units, often
                  LSTMs, to generate coherent sentences with long-range dependencies.
                     As is common in existing approaches, the evolved system uses a pretrained
                  ImageNet model [5] to produce initial image embeddings. The evolved network
                  takes an image embedding as input, along with a one-hot text input. As usual, in
                  training the text input contains the previous word of the ground truth caption; in
                  inference it contains the previous word generated by the model [47,49].
                     In the initial CoDeepNEAT population the image and text inputs are fed to a
                  shared embedding layer, which is densely connected to a softmax output over words.
                  From this simple starting point, CoDeepNEAT evolves architectures that include
                  fully connected layers, LSTM layers, sum layers, concatenation layers, and sets
                  of hyperparameters associated with each layer, along with a set of global hyperpara-
                  meters (Table 15.2). In particular, the well-known Show-and-Tell image captioning
                  architecture [49] is in this search space, providing a baseline with which evolution
                  results can be compared. These components and the glue that connects them are
                  evolved as described in Section 3.2, with 100 networks trained in each generation.
                     Since there is no single best accepted metric for evaluating captions, the fitness
                  function is the mean across three metrics (BLEU, METEOR, and CIDEr; [46])



                  Table 15.2 Node and Global Hyperparameters Evolved for the Image
                  Captioning Case Study
                   Global Hyperparameters            Range
                   Learning rate                     [0.0001, 0.1]
                   Momentum                          [0.68, 0.99]
                   Shared embedding size             [128, 512]
                   Embedding dropout                 [0, 0.7]
                   LSTM recurrent dropout            {True, False}
                   Nesterov momentum                 {True, False}
                   Weight initialization             {Glorot normal, He normal}
                   Node Hyperparameters              Range
                   Layer type                        {Dense, LSTM}
                   Merge method                      {Sum, Concat}
                   Layer size                        {128, 256}
                   Layer activation                  {ReLU, Linear}
                   Layer dropout                     [0, 0.7]
   305   306   307   308   309   310   311   312   313   314   315