Page 311 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 311

304    CHAPTER 15 Evolving Deep Neural Networks




                         normalized by their baseline values. Fitness is computed over a holdout set of 5000
                         images, that is 25,000 image-caption pairs.
                            To keep the computational cost reasonable, during evolution the networks are
                         trained for only six epochs, and only with a random 100,000 image subset of the
                         500,000 MSCOCO image-caption pairs. As a result, there is evolutionary pressure
                         towards networks that converge quickly: The best resulting architectures train to
                         near convergence 6 times faster than the baseline Show-and-Tell model [49]. After
                         evolution, the optimized learning rate is scaled by one-fifth to compensate for the
                         subsampling.

                         5.2 BUILDING THE APPLICATION

                         The images in MSCOCO are chosen to depict “common objects in context.” The
                         focus is on a relatively small set of objects and their interactions in a relatively small
                         set of settings. The Internet as a whole, and the online magazine website in partic-
                         ular, contain many images that cannot be classified as “common objects in context.”
                         Other types of images from the magazine include staged portraits of people, info-
                         graphics, cartoons, abstract designs, and iconic images, that is, images of one or
                         multiple objects out of context such as on a white or patterned background. There-
                         fore, an additional dataset of 17,000 image-caption pairs was constructed for the
                         case study, targeting iconic images in particular. Four thousand images were first
                         scraped from the magazine website, and 1000 of them were identified as iconic.
                         Then, 16,000 images that were visually similar to those 1000 were retrieved auto-
                         matically from a large image repository. A single ground truth caption for each of
                                                                                 1
                         these 17k images was generated by human subjects through MightyAI. The holdout
                         set for evaluation consisted of 100 of the original 1000 iconic images, along with
                         3000 other images.
                            During evolution, networks were trained and evaluated only on the MSCOCO
                         data. The best architecture from evolution was then trained from scratch on both
                         the MSCOCO and MightyAI datasets in an iterative alternating approach: one epoch
                         on MSCOCO, followed by five epochs on MightyAI, until maximum performance
                         was reached on the MightyAI holdout data. Beam search was then used to generate
                         captions from the fully trained models. Performance achieved using the MightyAI
                         data demonstrates the ability of evolved architectures to generalize to domains to-
                         ward which they were not evolved.
                            Once the model was fully trained, it was placed on a server where it can be
                         queried with images to caption. A JavaScript snippet was written that a developer
                         can embed in his/her site to automatically query the model to caption all images
                         on a page. This snippet runs in an existing Chrome extension for custom scripts
                         and automatically captions images as the user browses the web. These tools add




                         1
                          https://mty.ai/computer-vision/.
   306   307   308   309   310   311   312   313   314   315   316