Greetings,
Welcome to the 10th edition of the State of AI. In this issue, we delve into the realms of expansive AI motion tracking capabilities, groundbreaking efficiency improvements in sorting algorithms, nuanced video understanding with Video-ChatGPT, the almost magical advancements in music generation, and the fascinating new work about the process of supervision in neural networks
Each of these topics offers a glimpse into the diverse applications and continuous advancements in AI, promising an insightful and thought-provoking read. Enjoy!
Best regards,
Contents
Tracking Everything Everywhere All at Once
Faster sorting algorithms discovered using deep reinforcement learning
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
Simple and Controllable Music Generation
Let's Verify Step by Step
Tracking Everything Everywhere All at Once
Authors: Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely
Source & References: https://arxiv.org/abs/2306.05422
Introduction
A recent machine learning research paper titled "Tracking Everything Everywhere All at Once" by Qianqian Wang et al. proposes a groundbreaking method for estimating full-length motion trajectories for every pixel in every frame of a video. This novel approach, dubbed "OmniMotion", offers accurate, coherent long-range motion for even fast-moving objects and robustly tracks through occlusions. In this summary, we will discuss the main ideas and contributions of the paper for a wide Twitter tech audience.
The Problem
Existing motion estimation methods, such as optical flow and sparse feature tracking, have limitations in terms of capturing motion trajectories over long temporal windows and maintaining global consistency. These limitations can result in accumulated errors and spatio-temporal inconsistencies in the motion estimates. In addition, previous methods can lose track of objects when they are occluded, making them unable to handle complex dynamic scenes.
OmniMotion Representation
OmniMotion addresses these issues by using a quasi-3D canonical volume and 3D bijections between local and canonical spaces to ensure global consistency, track through occlusions, and model any combination of camera and object motion. The canonical volume acts as a three-dimensional atlas of the observed scene, with each point in the volume associated with a color and density. The 3D bijections are parameterized as invertible neural networks (INNs) that enable expressive mappings between local and canonical spaces.
Computing Frame-to-Frame Motion
To compute 2D motion between any two frames, OmniMotion uses a fixed, orthographic camera model to "lift" 3D points from each frame onto corresponding rays, which are then mapped to a target frame using the 3D bijections. The resulting mapped points are aggregated using alpha compositing and projected back to 2D to obtain the predicted corresponding pixel location.
Optimization Process
OmniMotion's optimization process takes as input a video sequence and a collection of noisy correspondence predictions (e.g., from existing optical flow methods) and generates a complete, globally consistent motion estimate for the entire video. The optimization minimizes the mean absolute error (MAE) between the predicted flow and the supervising input flow, as well as a photometric loss and a regularization term to ensure temporal smoothness.
Extensive Evaluations
The researchers put OmniMotion to the test on the TAP-Vid benchmark and real-world footage, clearly outperforming prior state-of-the-art methods both quantitatively and qualitatively. This demonstrates that OmniMotion can accurately track points and maintain coherence in space and time for complex, in-the-wild videos. Furthermore, OmniMotion is robust when given different types of input correspondence—such as RAFT optical flow and TAP-Net—achieving consistently high performance.
Limitations and Open Questions
Although OmniMotion shows promise, it has some limitations. One issue is that it may not handle certain video content as efficiently as other methods. Another potential limitation stems from the fact that OmniMotion doesn't explicitly disentangle camera and scene motion, so it does not produce physically accurate 3D scene reconstructions. However, refining the approach to address these issues could lead to even more accurate and versatile motion representations.
Implications and Potential Applications
OmniMotion is a significant advancement in the field of motion estimation, offering a globally consistent motion representation that can track points accurately over an entire video even during events like occlusions. This has numerous applications across various domains, including computer vision tasks, video editing, and even in fields like traffic analysis, sports analytics, and surveillance.
Conclusion
In conclusion, the OmniMotion representation presented in the research paper by Wang and colleagues is a major step forward in motion trajectory estimation for videos. This work pushes the boundaries of what can be achieved in terms of dense, long-range motion estimations and offers numerous potential applications in computer vision and related fields. Its success in capturing global consistency, tracking through occlusions, and handling complex dynamic scenes marks it as a pivotal development in the study of motion tracking.
Faster Sorting Algorithms Discovered Using Deep Reinforcement Learning
Authors: Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, Thomas Köppe, Kevin Millikin, Stephen Gaffney, Sophie Elster, Jackson Broshear, Chris Gamble, Kieran Milan, Robert Tung, Minjae Hwang, Taylan Cemgil, Mohammadamin Barekatain, Yujia Li, Amol Mandhane, Thomas Hubert, Julian Schrittwieser, Demis Hassabis, Pushmeet Kohli, Martin Riedmiller, Oriol Vinyals, David Silver
Source & References: https://www.nature.com/articles/s41586-023-06004-9
The Quest for Faster Algorithms
Sorting and hashing algorithms are fundamental components of countless computational processes used daily around the globe. As the demand for computational power continues to rise, finding ways to optimize these algorithms has become increasingly essential. Both human researchers and computational methods have struggled to develop more efficient routines, but in this groundbreaking study, artificial intelligence (AI) demonstrates its capacity to surpass the status quo. The authors introduce AlphaDev, a novel deep reinforcement learning agent designed to discover new and more efficient sorting algorithms autonomously.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.