Notes on CoTracker: It is Better to Track Together

Link to paper: https://arxiv.org/abs/2307.07635

Paper published on: 2023-07-14

Paper's authors: Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, Christian Rupprecht

GPT3 API Cost: $0.07

GPT4 API Cost: $0.11

Total Cost To Write This: $0.17

Time Savings: 35:1

The ELI5 TLDR:

CoTracker is a new way to track multiple points in a video. It uses a special network called a transformer to understand how points relate to each other across different frames. This allows it to track each point throughout the video, even if new points enter or old ones leave. CoTracker works by focusing on a subset of frames at a time. It uses an iterative algorithm to track points and considers the relationship between points at different scales. CoTracker is trained using synthetic data and evaluated on different datasets, and it outperforms other methods. It uses a feature CNN and is trained with iterative updates. CoTracker has some limitations, but there are ideas for future improvements. Overall, CoTracker could have a big impact on things like autonomous vehicles, video editing, and surveillance.

The Deeper Dive:

Summary: CoTracker - A Novel Approach to Video Tracking

The paper introduces a new architecture known as CoTracker, designed for tracking multiple points in a video. This novel architecture combines ideas from the optical flow and tracking literature, using a transformer network to model the correlation of different points in time.

To imagine how this works, consider a video of a flock of birds. Traditional tracking methods might struggle to follow each bird individually, especially as they move and overlap. CoTracker, however, can track each bird (or point) throughout the video, even as new birds enter the scene or others leave. It's a sliding-window approach, meaning it can handle long videos by focusing on a subset of frames at a time.

Understanding CoTracker's Architecture

CoTracker is based on a transformer network, a type of model that uses attention mechanisms to focus on different parts of the input data. In CoTracker, these attention layers are specialized to model the correlation of different points in time. This means that the network can understand how points relate to each other across different frames in the video.

The architecture also introduces the concept of input and output tokens. Input tokens code for position, visibility, appearance, and correlation of the tracks, while output tokens contain the updated locations and appearance features. This token system allows the network to keep track of the state of each point at each step.

The CoTracker Algorithm

CoTracker uses an iterative algorithm to track points. It starts by initializing a set of points to track, represented by P. These points are then tracked in a sliding window of video frames, using a set of tracking features, Q. The algorithm then updates the estimated tracks and starting locations (P) and the tracking features (Q) in an iterative manner, using a copy operation to ensure consistency.

As part of this process, the algorithm introduces the concept of correlation at S scales. This means that the algorithm considers the relationship between points at different scales or levels of detail, which can help improve the accuracy of the tracking.

Training and Evaluating CoTracker

The CoTracker model is trained using synthetic data and evaluated on several benchmark datasets. The training process involves an unrolled inference approach, which allows the model to handle semi-overlapping windows. The primary loss function is for track regression, with a secondary loss for visibility flags.

Evaluation metrics include position accuracy, occlusion accuracy, and average Jaccard. Across almost all benchmarks, CoTracker outperforms existing methods. The evaluation protocol also ensures fairness by testing different distributions of tracked points.

Implementing CoTracker

In terms of implementation, CoTracker uses a feature CNN that downsamples the input image by a factor of 8 and outputs features with 128 channels. The model is trained with 4 iterative updates and evaluated with 6 updates, using a batch size of 32 distributed across 32 GPUs. The learning rate is set to 5e-4, and the model is trained for 50,000 iterations.

During training, data augmentations such as Color Jitter and Gaussian Blur are used to increase the robustness of the model. Sliding windows are used to pass information from one window to the next, with binary masks indicating where predictions are needed.

Limitations and Future Directions

While CoTracker presents a significant advance in video tracking, it does have some limitations. For example, it struggles to track points through occlusions longer than the size of a single window. Additionally, the transformer complexity of CoTracker is quadratic in the number of tracked points, which limits its application to dense prediction.

However, the paper suggests several avenues for future work, such as exploring different point selection strategies at inference time, unrolling the sliding window through time during training to improve performance, and experimenting with different feature networks, model strides, and sliding window sizes.

Conclusion: The Potential Impact of CoTracker

CoTracker offers a new approach to video tracking that could have significant implications for a range of applications, from autonomous vehicles to video editing to surveillance. By simultaneously tracking groups of points and learning to account for their correlation, CoTracker can handle complex, dynamic scenes with more accuracy and efficiency than existing methods.

By understanding the workings of CoTracker, you can explore how this approach might be applied to your own work, whether that's developing new video analysis tools, improving existing systems, or exploring new research directions in video tracking.

Notes on CoTracker: It is Better to Track Together

The ELI5 TLDR:

The Deeper Dive: