Notes on Generative Pretraining in Multimodality

Link to paper: https://arxiv.org/abs/2307.05222

Paper published on: 2023-07-11

Paper's authors: Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

GPT3 API Cost: $0.05

GPT4 API Cost: $0.13

Total Cost To Write This: $0.18

Time Savings: 24:1

An Introduction to Emu: A Transformer-based Multimodal Foundation Model

In the realm of artificial intelligence, a recent research paper has introduced Emu, a multimodal foundation model based on the Transformer architecture. This model is capable of processing both visual and textual data in a unified manner, leading to advancements in tasks such as image-to-text and text-to-image conversion, image captioning, visual question answering, and video question answering. In simple terms, Emu is a model that can understand and generate both images and text, making it a powerful tool for multimodal tasks.

For instance, consider a scenario where a user inputs a sentence like "A cat sitting on a red mat". Emu would be capable of generating a realistic image of a cat sitting on a red mat, illustrating its text-to-image capabilities. Conversely, if an image of a dog playing with a ball is inputted, Emu could generate a descriptive sentence like "A dog is playing with a ball", demonstrating its image-to-text capabilities.

Emu's Architecture and Training

Emu's architecture consists of a Visual Encoder, a Causal Transformer, a Multimodal Modeling Large Language Model (LLM), and a Visual Decoder.

The Visual Encoder's role is to convert visual signals into embeddings, which are compact representations that capture the essential features of the visual data. The Causal Transformer then transforms these visual embeddings into a fixed number of visual causal embeddings. The Causal Transformer is designed to capture the causal dependency of the given image, transforming 2D spatial visual signals into 1D causal sequences in a latent space.

The Multimodal Modeling LLM performs autoregressive modeling of multimodal sequences, which means it predicts the next element in a sequence based on the previous elements. The Visual Decoder then decodes the visual embeddings back into images. In essence, Emu can generate realistic images from regressed visual embeddings using a fine-tuned latent diffusion model.

Emu's training objective is to maximize the likelihood of the web-scale corpora in an auto-regressive manner. This means it learns to predict the next element in a sequence, whether it's a text token or a visual embedding. Two types of losses are used in this process: cross-entropy loss for text tokens and ℓ2 regression loss for visual embeddings.

Emu's Data Sources and Initialization

Emu can explore diverse pretraining data sources at scale, such as videos with interleaved frames and text, webpages with interleaved images and text, and web-scale image-text pairs and video-text pairs. This means it can learn from a wide range of data, including images and text that appear together in the same context, which is often the case in real-world data.

Emu is initialized with EVA-02-CLIP and LLaMA models. EVA-02-CLIP is a model that understands images and text in a unified embedding space, while LLaMA is a large language model that has been trained on a vast amount of text data. The Visual Decoder is then fine-tuned with image-text pair datasets, which helps it to generate realistic images from textual descriptions.

Emu's Performance and Applications

Emu has demonstrated superb performance compared to state-of-the-art large multimodal models in various zero-shot and few-shot tasks. Zero-shot tasks are those where the model is tested on a task it has not seen during training, while few-shot tasks are those where the model is given a few examples of a new task before being tested on it.

Emu can be used as a multimodal assistant via instruction tuning. This means it can be fine-tuned to follow human instructions, making it a useful tool for tasks that require understanding and generating both images and text.

Emu can perform various types of completion in a multimodal sequence. For example, given a series of images and text, it can predict what comes next, whether it's an image or a piece of text.

Conclusion

Emu represents a significant advancement in multimodal AI models. Its ability to understand and generate both images and text, combined with its impressive performance on various tasks, makes it a powerful tool for many applications. Whether it's generating images from textual descriptions, answering questions about images or videos, or acting as a multimodal assistant, Emu is pushing the boundaries of what's possible in the field of AI.