Page 310 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 310
5. Application Case Study: Image Captioning for the Blind 303
5.1 EVOLVING DNNs FOR IMAGE CAPTIONING
Deep learning has recently provided state-of-the-art performance in image
captioning, and several diverse architectures have been suggested [47e51].The
input to an image captioning system is a raw image, and the output is a text caption
intended to describe the contents of the image. In deep learning approaches, a con-
volutional network is usually used to process the image, and recurrent units, often
LSTMs, to generate coherent sentences with long-range dependencies.
As is common in existing approaches, the evolved system uses a pretrained
ImageNet model [5] to produce initial image embeddings. The evolved network
takes an image embedding as input, along with a one-hot text input. As usual, in
training the text input contains the previous word of the ground truth caption; in
inference it contains the previous word generated by the model [47,49].
In the initial CoDeepNEAT population the image and text inputs are fed to a
shared embedding layer, which is densely connected to a softmax output over words.
From this simple starting point, CoDeepNEAT evolves architectures that include
fully connected layers, LSTM layers, sum layers, concatenation layers, and sets
of hyperparameters associated with each layer, along with a set of global hyperpara-
meters (Table 15.2). In particular, the well-known Show-and-Tell image captioning
architecture [49] is in this search space, providing a baseline with which evolution
results can be compared. These components and the glue that connects them are
evolved as described in Section 3.2, with 100 networks trained in each generation.
Since there is no single best accepted metric for evaluating captions, the fitness
function is the mean across three metrics (BLEU, METEOR, and CIDEr; [46])
Table 15.2 Node and Global Hyperparameters Evolved for the Image
Captioning Case Study
Global Hyperparameters Range
Learning rate [0.0001, 0.1]
Momentum [0.68, 0.99]
Shared embedding size [128, 512]
Embedding dropout [0, 0.7]
LSTM recurrent dropout {True, False}
Nesterov momentum {True, False}
Weight initialization {Glorot normal, He normal}
Node Hyperparameters Range
Layer type {Dense, LSTM}
Merge method {Sum, Concat}
Layer size {128, 256}
Layer activation {ReLU, Linear}
Layer dropout [0, 0.7]