Page 312 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 312

5. Application Case Study: Image Captioning for the Blind  305






















                  FIGURE 15.4
                  An iconic image from an online magazine captioned by an evolved model. The model
                  provides a suitably detailed description without any unnecessary context.

                  captions to the “alt” field of images, which screen readers can then read to blind
                  Internet users (Fig. 15.4).


                  5.3 IMAGE CAPTIONING RESULTS
                  Trained in parallel on about 100 GPUs, each generation took around 1 h to complete.
                  The most fit architecture was discovered on generation 37 (Fig. 15.5). This architec-
                  ture performs better than the hand-tuned baseline [49] when trained on the
                  MSCOCO data alone (Table 15.3).
                     However, a more important result is the performance of this network on the
                  magazine website. Because no suitable automatic metrics exist for the types of cap-
                  tions collected for the magazine website (and existing metrics are very noisy when
                  there is only one reference caption), captions generated by the evolved model on all
                  3100 holdout images were manually evaluated as correct, mostly correct, mostly
                  incorrect, and incorrect (Fig. 15.6). Fig. 15.7 shows some examples of good and
                  bad captions for these images.
                     The model is not perfect, but the results are promising. The evolved network is
                  correct or mostly correct on 63% of iconic images and 31% of all images. There are
                  many known improvements that can be implemented, including ensembling diverse
                  architectures generated by evolution, fine-tuning of the ImageNet model, using a
                  more recent ImageNet model, and performing beam search or scheduled sampling
                  during training [54] (preliminary experiments with ensembling alone suggest im-
                  provements of about 20%). For this application, it is also important to include
                  methods for automatically evaluation caption quality and filtering captions that
                  would give an incorrect impression to a blind user. However, even without these ad-
                  ditions, the results demonstrate that it is now possible to develop practical applica-
                  tions through evolving DNNs.
   307   308   309   310   311   312   313   314   315   316   317