Page 311 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 311
304 CHAPTER 15 Evolving Deep Neural Networks
normalized by their baseline values. Fitness is computed over a holdout set of 5000
images, that is 25,000 image-caption pairs.
To keep the computational cost reasonable, during evolution the networks are
trained for only six epochs, and only with a random 100,000 image subset of the
500,000 MSCOCO image-caption pairs. As a result, there is evolutionary pressure
towards networks that converge quickly: The best resulting architectures train to
near convergence 6 times faster than the baseline Show-and-Tell model [49]. After
evolution, the optimized learning rate is scaled by one-fifth to compensate for the
subsampling.
5.2 BUILDING THE APPLICATION
The images in MSCOCO are chosen to depict “common objects in context.” The
focus is on a relatively small set of objects and their interactions in a relatively small
set of settings. The Internet as a whole, and the online magazine website in partic-
ular, contain many images that cannot be classified as “common objects in context.”
Other types of images from the magazine include staged portraits of people, info-
graphics, cartoons, abstract designs, and iconic images, that is, images of one or
multiple objects out of context such as on a white or patterned background. There-
fore, an additional dataset of 17,000 image-caption pairs was constructed for the
case study, targeting iconic images in particular. Four thousand images were first
scraped from the magazine website, and 1000 of them were identified as iconic.
Then, 16,000 images that were visually similar to those 1000 were retrieved auto-
matically from a large image repository. A single ground truth caption for each of
1
these 17k images was generated by human subjects through MightyAI. The holdout
set for evaluation consisted of 100 of the original 1000 iconic images, along with
3000 other images.
During evolution, networks were trained and evaluated only on the MSCOCO
data. The best architecture from evolution was then trained from scratch on both
the MSCOCO and MightyAI datasets in an iterative alternating approach: one epoch
on MSCOCO, followed by five epochs on MightyAI, until maximum performance
was reached on the MightyAI holdout data. Beam search was then used to generate
captions from the fully trained models. Performance achieved using the MightyAI
data demonstrates the ability of evolved architectures to generalize to domains to-
ward which they were not evolved.
Once the model was fully trained, it was placed on a server where it can be
queried with images to caption. A JavaScript snippet was written that a developer
can embed in his/her site to automatically query the model to caption all images
on a page. This snippet runs in an existing Chrome extension for custom scripts
and automatically captions images as the user browses the web. These tools add
1
https://mty.ai/computer-vision/.