Notes on InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Link to paper: https://arxiv.org/abs/2307.06942

Paper published on: 2023-07-13

Paper's authors: Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinyuan Chen, Yaohui Wang, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao

Welcome to our exploration of InternVid, a large-scale video-centric multimodal dataset, and ViCLIP, a video-text representation learning model based on ViT-L. To help you understand these concepts better, think of InternVid as a vast library of videos, each accompanied by a detailed book that describes its content. On the other hand, ViCLIP is like a librarian who has read all these books and watched all these videos, and can now make smart connections between the two.

InternVid is a treasure trove of over 7 million videos clocking nearly 760K hours, with 234M video clips and 4.1B words of detailed descriptions. The dataset is created using a multi-scale approach to generate video-related descriptions, ensuring high-quality video-text data with minimal human intervention. Imagine having a microscope that allows you to zoom in for finer details and zoom out for a broader perspective. That's exactly what the multi-scale approach does for video descriptions.

ViCLIP, the video-text representation learning model, is trained on InternVid via contrastive learning. Think of contrastive learning as a process of understanding what makes a video and its description similar, and what makes them different from other videos and descriptions. This model achieves leading zero-shot action recognition and competitive video retrieval performance.

The video encoder of ViCLIP uses a standard ViT with spatiotemporal attention and applies random patch masking to the input videos. Imagine watching a video while focusing on different parts of the screen at different times (spatiotemporal attention) and sometimes putting random stickers on the screen (random patch masking). The text encoder, on the other hand, is a transformer.

The framework optimizes video-text alignment using the InfoNCE loss and cosine similarity between video and text features. Consider InfoNCE loss as a measure of how well our librarian (ViCLIP) is doing at connecting the right video with the right book. Cosine similarity is like a compass that helps the model understand the direction of the relationship between the video and text features.

The performance of ViCLIP is evaluated on popular video-related benchmarks, including zero-shot action recognition and fine-tuned video retrieval tasks. The results show that ViCLIP outperforms previous methods on these benchmarks, demonstrating the effectiveness of video-text representation learning at scale.

One of the key takeaways from this research is that data quality is more critical than data scale in representation learning. In other words, it's not just about how many videos and books our librarian reads, but also about the quality of these materials. Another important finding is that video-language pretraining is crucial for enhancing fine-tuned and zero-shot retrieval performance.

The research also introduces a video generation baseline using a U-Net with a transformer for text-to-video generation. This is akin to having a movie director who can create a video based on a script (text). The quality of these synthesized videos is evaluated using metrics such as framewise-FID, FVD, Inception Score, and clip similarity.

The word distribution of generated captions in InternVid includes objects, attributes, locations, scenes, actions/events, and more. This is similar to the variety of vocabulary that a novelist would use to write a captivating story. The research also includes word distributions of ASR transcripts in different languages, reflecting the multilingual nature of the dataset.

In conclusion, the research introduces InternVid, a large-scale video-centric multimodal dataset, and ViCLIP, a video-text representation learning model, which have broad applications in multimodal video understanding and generation. The research highlights the importance of data quality over data scale in representation learning and the effectiveness of video-language pretraining. The research also emphasizes the potential of improving video captioning and the utility of video captions in video understanding and generation tasks.

Notes on InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

More from this blog

Notes on Android in the Wild: A Large-Scale Dataset for Android Device Control

Notes on LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs

Notes on Text2Layer: Layered Image Generation using Latent Diffusion Model

Notes on DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

Notes on Towards A Unified Agent with Foundation Models

Command Palette

More from this blog