Deep Visual-Semantic Alignments for Generating Image Descriptions

The approach leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between language and visual data. The alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding.
The alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets, and the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
The alignment model learns to associate images and snippets of text. For an image, the model retrieves the most compatible sentence and grounds its pieces in the image. The grounding is shown as a line to the center of the corresponding bounding box. Each box has a single but arbitrary color.

Share This Article