Notes on Test-Time Training on Video Streams

Link to paper: https://arxiv.org/abs/2307.05014

Paper published on: 2023-07-12

Paper's authors: Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

GPT3 API Cost: $0.05

GPT4 API Cost: $0.15

Total Cost To Write This: $0.2

Time Savings: 21:1

Test-Time Training (TTT) on Video Streams

This tutorial will explore a new approach to improving prediction accuracy on unlabeled test data, known as Test-Time Training (TTT) as presented in the research paper. This method is particularly effective when applied to video streams. The model is trained on each test instance using a self-supervised task before making a prediction. This technique is extended to a streaming setting where multiple test instances (video frames) arrive in temporal order. The extension is termed as online TTT.

Online Test-Time Training (TTT)

Online TTT is a process where the current model is initialized from the previous model and trained on the current frame and a small window of frames immediately before. This process has been observed to significantly outperform the fixed-model baseline for four tasks on three real-world datasets. It also outperforms its offline variant that accesses more information, training on all frames from the entire test video regardless of temporal order. The authors attribute the advantage of online over offline TTT to a concept they call "locality".

The Concept of Locality

Locality is a concept that explains why online TTT is more effective than offline TTT. It is based on the idea that nearby frames in a video are more likely to contain similar information than frames that are far apart. Therefore, a model trained on a small window of frames immediately before the current frame is likely to perform better than a model trained on all frames, regardless of their temporal order. This concept is supported by ablation studies and a theory based on the bias-variance trade-off.

The Role of Masked Autoencoders (MAE) in TTT

Masked Autoencoders (MAE) play a crucial role in self-supervised training in TTT-MAE. The self-supervised task involves masked image reconstruction where a model learns to reconstruct masked patches of an image. The main task can be any task, such as segmentation, and is trained using the original, unmasked patches of the image. The empirical success of TTT-MAE inspired the authors to make it the inner loop of online TTT and extend it to other main tasks such as segmentation.

Test-Time Training with Masked Autoencoders (TTT-MAE)

TTT-MAE is a method for training models at test time. It involves training a model on a self-supervised task and a main task simultaneously. The TTT-MAE method is applied to video streams, where the model is trained to make predictions on each frame of the video as it is received. Joint training is proposed as a method for optimizing the model components during training, which involves optimizing the self-supervised task and main task losses together. The results show that the TTT-MAE method improves the performance of the model on the main task compared to the baseline method.

TTT-MAE for Video Segmentation

The research focuses on test-time training for video segmentation. The authors propose a method that uses both implicit and explicit memory to improve segmentation performance. The method is evaluated on multiple datasets, including KITTI-STEP and COCO Videos. The results show that the proposed method outperforms baseline techniques and achieves state-of-the-art performance on KITTI-STEP. The method is computationally efficient and runs faster than previous approaches.

Experimental Setup and Results

The experiments were conducted on a new video dataset called COCO Videos, which contains longer and more challenging videos than other public datasets. The relative improvements on COCO Videos are 45% and 66% for instance and panoptic segmentation. The authors also collected and annotated their own dataset of videos to evaluate the method. The benchmark metrics used are average precision (AP) for instance segmentation and panoptic quality (PQ) for panoptic segmentation.

The Mask2Former model, pre-trained on still images in the COCO training set, was used as a starting point for training on the videos. The performance of the Mask2Former model dropped significantly when evaluated on the videos, highlighting the challenging nature of the dataset. The online TTT-MAE method, with a small window size, improved instance and panoptic segmentation performance by 45% and 66% respectively compared to the baseline method.

The Importance of Window Size in TTT-MAE

The window size for TTT-MAE is an important hyperparameter, with too little or too much memory hurting performance. Shuffling the frames within each video, destroying temporal smoothness, significantly reduces the performance of the online TTT-MAE method. Theoretical analysis shows that the choice of window size for TTT can have a significant impact on performance.

Conclusion

In conclusion, the paper introduces a new method, Test-Time Training (TTT), for improving prediction accuracy on unlabeled test data. TTT uses a sliding window approach to incorporate past inputs into the prediction process. The paper presents theoretical analysis and empirical results to support the effectiveness of TTT. The authors also acknowledge support from Oracle Cloud and other funding sources.

Notes on Test-Time Training on Video Streams