Page 312 - Artificial Intelligence in the Age of Neural Networks and Brain Computing
P. 312
5. Application Case Study: Image Captioning for the Blind 305
FIGURE 15.4
An iconic image from an online magazine captioned by an evolved model. The model
provides a suitably detailed description without any unnecessary context.
captions to the “alt” field of images, which screen readers can then read to blind
Internet users (Fig. 15.4).
5.3 IMAGE CAPTIONING RESULTS
Trained in parallel on about 100 GPUs, each generation took around 1 h to complete.
The most fit architecture was discovered on generation 37 (Fig. 15.5). This architec-
ture performs better than the hand-tuned baseline [49] when trained on the
MSCOCO data alone (Table 15.3).
However, a more important result is the performance of this network on the
magazine website. Because no suitable automatic metrics exist for the types of cap-
tions collected for the magazine website (and existing metrics are very noisy when
there is only one reference caption), captions generated by the evolved model on all
3100 holdout images were manually evaluated as correct, mostly correct, mostly
incorrect, and incorrect (Fig. 15.6). Fig. 15.7 shows some examples of good and
bad captions for these images.
The model is not perfect, but the results are promising. The evolved network is
correct or mostly correct on 63% of iconic images and 31% of all images. There are
many known improvements that can be implemented, including ensembling diverse
architectures generated by evolution, fine-tuning of the ImageNet model, using a
more recent ImageNet model, and performing beam search or scheduled sampling
during training [54] (preliminary experiments with ensembling alone suggest im-
provements of about 20%). For this application, it is also important to include
methods for automatically evaluation caption quality and filtering captions that
would give an incorrect impression to a blind user. However, even without these ad-
ditions, the results demonstrate that it is now possible to develop practical applica-
tions through evolving DNNs.