Notes on SITTA: A Semantic Image-Text Alignment for Image Captioning

Link to paper: https://arxiv.org/abs/2307.05591

Paper published on: 2023-07-10

Paper's authors: Fabian Paischer, Thomas Adler, Markus Hofmarcher, Sepp Hochreiter

GPT3 API Cost: $1.12

GPT4 API Cost: $0.18

Total Cost To Write This: $1.3

Time Savings: 33:1

Imagine two people who speak different languages and are trying to communicate. They have a shared understanding of the world but lack a common language to express it. This is similar to the problem addressed by the research paper introducing SITTA, a Semantic Image-Text Alignment for Image Captioning. Here, the two "people" are pretrained models, one for language and one for vision, and the "shared understanding" is the semantic content of images and text. The research team's task was to construct a "translator" or mapping that could effectively communicate between these two models.

The research team used two methods to build this translator. The first aligned the language model's embedding space with the vision model's embedding space using token correspondences, like using a bilingual dictionary to translate. The second method used additional image-text pairs to construct the mapping directly from vision to language space, akin to learning a language through immersion in a foreign country.

The brilliance of SITTA lies in its simplicity and efficiency. Unlike most methods that require backpropagation of gradients through the language model for image captioning, SITTA uses mappings that are computed beforehand. This makes the method computationally light and accessible even for institutions with limited resources.

Now, let's delve into the specifics. The research team compared four linear mapping methods: Ordinary Least Squares (OLS), Ridge, Procrustes, and RobProc. Procrustes emerged as the most effective, capable of identifying shared symmetries between the embedding spaces, akin to finding common patterns in different languages. However, when trained on the MS-COCO dataset's training split, OLS yielded the highest Normalized Discounted Cumulative Gain (NDCG), a measure of ranking quality, on average. Yet, Procrustes outperformed OLS on smaller datasets, proving its effectiveness in resource-constrained scenarios.

For image captioning, the team combined their semantic mapping with a generative language model, specifically the 7 B version of Llama. This combination, SITTA, was compared to existing methods that transfer a pretrained language model to vision-language tasks in a zero-shot manner, meaning without additional training on the specific task. SITTA showed impressive performance on the MS-COCO benchmark, outperforming other zero-shot methods and even some that perform fine-tuning on image captions.

The research also evaluated the performance of different decoding strategies and vision backbones on the MS-COCO dataset. They found that the largest ResNet variant had better captioning performance than the vision-transformer based architecture, and only greedy decoding yielded valid and meaningful captions.

Interestingly, the team found that inducing variation by permuting tokens in the prompt led to a substantial improvement. This could be likened to rearranging words in a sentence to create more meaningful or accurate translations. They also noted that the mapping trained via lexical matching performed significantly worse than the mapping trained via an external dataset, highlighting the importance of diverse and rich training data.

The research also evaluated SITTA on the Flickr30k dataset, where it again outperformed other methods. Notably, SITTA outperformed MAGIC, Wang et al., and even CapDec on MS-COCO, despite CapDec carrying approximately 193 times more trainable parameters. This is akin to a lightweight, efficient translator outperforming a large, resource-intensive one.

The team also found that the orthogonal constraint on the mapping facilitated transfer across datasets, similar to how understanding the structure of a language can help in learning other languages. They also found that SITTA could work even for models of comparatively low complexity, making image captioning more accessible for users with limited resources.

The research paper concludes by noting that SITTA's semantic mapping only comprises approximately 4 M parameters and training requires only several minutes on a CPU. This makes it a powerful tool for image captioning, outperforming existing related methods that are trained end-to-end or fine-tune the language model for captioning and require much more compute. The future aim is to adapt the method for multiple downstream tasks and to compute the mappings on data originating from different tasks to endow the method with more sophisticated visual reasoning capabilities. The authors advocate for open research and reproducibility, making all their code and pretrained mappings available for evaluation publicly.

In conclusion, the research introduces an efficient, accessible, and effective method to semantically map between the embedding spaces of a pretrained vision encoder and a generative language model. By doing so, it opens up the possibility of generating high-quality image captions even with limited computational resources.

Notes on SITTA: A Semantic Image-Text Alignment for Image Captioning

More from this blog

Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

Notes on LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

Notes on Text2Layer: Layered Image Generation using Latent Diffusion Model

Notes on DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Notes on Towards A Unified Agent with Foundation Models

Command Palette

More from this blog