Dear readers,
Welcome to the fourth edition of the State of AI newsletter! We are truly grateful for your continued support as we bring you the most discussed ML/AI research papers of the week, curated entirely by GPT-4. As the field of AI advances at an incredible pace, it becomes nearly impossible to keep up with all the progress. Our goal is to distill the wealth of information into a digestible read that keeps you informed and engaged.
In this week's edition, we explore a diverse range of topics, from extending the context length of BERT with Recurrent Memory Transformer (RMT) to autonomous agents collecting data for Neural Radiance Fields (NeRF). We also delve into learning robust visual features without supervision through DINOv2, examining the limits of differentiable programming techniques, and investigating adversarial patches designed to fool automated surveillance cameras.
As we continue to navigate the ever-evolving AI landscape, we hope our newsletter helps you stay updated on the most recent research and developments. Thank you for joining us on this journey, and we look forward to bringing you even more insightful content. Happy reading!
Best regards,
Contents
Scaling Transformer to 1M tokens and beyond with RMT
DINOv2: Learning Robust Visual Features without Supervision
AutoNeRF: Training Implicit Scene Representations with Autonomous Agents
Gradients are Not All You Need
Fooling automated surveillance cameras: adversarial patches to attack person detection
BONUS - A Cookbook of Self-Supervised Learning
Scaling Transformer to 1M tokens and beyond with RMT
Authors: Aydar Bulatov, Yuri Kuratov, Mikhail S. Burtsev
Source & References: https://arxiv.org/abs/2304.11062v1
Introducing Recurrent Memory Transformer
Transformers have revolutionized the field of natural language processing with their unparalleled performance in a variety of tasks. However, their quadratic complexity in terms of attention makes it increasingly difficult to apply them to longer inputs. Enter the Recurrent Memory Transformer (RMT) – a game-changing architecture that extends BERT's context length up to an astounding two million tokens! The authors of this technical report have not only managed to increase the context length of the model but also done so without sacrificing memory retrieval accuracy. The result is an approach that allows for the storage and processing of both local and global information, making it possible to handle long-term dependencies in text.
A Plug-and-Play Solution
One of the key advantages of the RMT architecture is its plug-and-play nature, making it a perfect wrapper for popular Transformers like BERT. The RMT memory is based on real-valued trainable vectors called memory tokens. These tokens are prepended to the first segment of the input sequence and processed alongside the input tokens in each segment. This way, the model maintains a fixed memory size while sequentially processing input segments, allowing it to handle much longer inputs than the base model. Importantly, this memory augmentation technique is compatible with any model from the Transformer family, unlocking new possibilities for a wide array of NLP tasks.
Computational Efficiency
The RMT model offers impressive computational efficiency thanks to its linear scaling of computations. With this approach, the quadratic complexity of traditional Transformers is reduced to linear, making it possible to work with sequences of up to one million tokens using a single Nvidia GTX 1080Ti GPU. This means that the model can process much longer text inputs without the need for more memory-heavy architectures. The proposed RMT architecture achieves this linear scaling by employing a recurrent approach with memory, which provides not only better performance but also reduced FLOPs compared to non-recurrent models.
Testing RMT with Synthetic Tasks
To evaluate the performance of the RMT model in terms of memorization and reasoning abilities, the authors devised synthetic tasks with various difficulty levels. These tasks require the model to memorize simple facts and carry out basic reasoning in the presence of irrelevant text, which adds an extra layer of challenge. The synthetic tasks are split into three categories: Fact Memorization, Fact Detection & Memorization, and Reasoning with Memorized Facts. These tasks push the RMT model to its limits, allowing it to handle both local and global information storage and processing.
Curriculum Learning for Better Results
One major finding of this research is that the RMT model benefits from a training schedule, which greatly enhances solution accuracy and stability. In this approach, the model is initially trained on shorter versions of the task, and the task length is increased gradually once the model converges. This process, known as curriculum learning, enables the RMT model to tackle larger tasks more effectively as it has already learned from shorter tasks. Ultimately, the model's generalization capabilities are improved, giving it the ability to handle tasks with a wider range of input lengths.
Astonishing Generalization Capabilities
An essential question in this research revolves around how well the RMT model generalizes to different sequence lengths. The authors found that the RMT models trained on a larger number of segments were able to generalize remarkably well for tasks twice as long as their original training tasks. Moreover, the RMT model continues to exhibit remarkable extrapolation performance on incredibly long sequences, demonstrating the effectiveness of learned memory operations even when used thousands of times.
Unlocking the Secrets of Memory Operations
By analyzing attention patterns in the RMT model, the authors have managed to uncover several interesting insights into the memory operations involved in tackling their synthetic tasks. They found that specific patterns in the attention corresponded to memory-related operations, hinting at the underlying mechanisms that enable the RMT model to handle long-term dependencies in text. This finding paves the way for further research into developing even more efficient memory systems for the Transformer architecture.
Unleashing the Power of RMT
In summary, the Recurrent Memory Transformer is a groundbreaking innovation that scales the application of Transformers to a million tokens and beyond. This makes it ideally suited for handling long-term dependencies and context-rich text inputs, opening up new horizons within the field of natural language processing. The authors showcase the model's remarkable capabilities through a series of synthetic tasks, demonstrating its superior memorization and reasoning abilities compared to traditional Transformer models. As the authors aim to tailor the RMT approach to the most commonly used Transformers, it is clear that this research holds significant potential for enhancing long-term dependency handling in various NLP applications. With the advent of RMT, a whole new era of large-scale context processing for memory-intensive applications might just be around the corner. Stay tuned to see what the RMT has in store for the future of natural language processing!
DINOv2: Learning Robust Visual Features without Supervision
Authors: Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, Piotr Bojanowski
Source & References: https://arxiv.org/abs/2304.07193v1
Introduction
The recent breakthroughs in natural language processing (NLP) have had a massive impact on the field, giving rise to powerful foundation models that can be applied across a variety of tasks. Now, researchers are focusing on bringing that same level of success to computer vision, with the aim of producing all-purpose visual features that can perform well on any task without extensive fine-tuning. In this research paper, the authors explore self-supervised learning, a technique that learns from large quantities of curated visual data, to create robust visual features that surpass existing methods.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.