Greetings,
Welcome to the 17th edition of the State of AI. As we forge ahead with our exploration of the artificial intelligence landscape, we'll be delving into the exciting world of action translation from visual and linguistic inputs, efficient cross-task generalization through dynamic compositions, and the innovative blend of audio creation with Large Language Models.
In addition, we'll venture into the burgeoning field of 3D world integration in large language models, offering a whole new perspective on AI's capabilities. Lastly, we'll delve into the intriguing challenges and inherent limitations of reinforcement learning from human feedback, a critical component in AI's evolution.
This issue promises a dynamic exploration into these cutting-edge advancements, offering a deeper understanding of the ever-expanding capabilities of AI. So sit back, and let us guide you through another intriguing journey in the world of artificial intelligence.
Best regards,
Contents
RT-2: New model translates vision and language into action
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
WavJourney: Compositional Audio Creation with Large Language Models
3D-LLM: Injecting the 3D World into Large Language Models
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, and others from Google DeepMind
Source & References: https://www.deepmind.com/blog/rt-2-new-model-translates-vision-and-language-into-action
Introduction
Google DeepMind researchers have introduced a new model, RT-2, that aims to incorporate the capabilities of large-scale pre-trained vision-language models into end-to-end robotic control tasks. This groundbreaking research explores how these models can enable robots to perform various tasks in real-world environments, with improved generalization skills and semantic reasoning abilities. The authors propose "vision-language-action" (VLA) models, which leverage the knowledge from internet-scale training and seamlessly integrate it with robotic control.
Vision-Language Models
Researchers have long sought to create powerful and versatile vision-language models (VLMs) capable of handling complex tasks in various domains, including robotics. These models take as input one or more images and generate sequences of tokens, encoding high-level tasks and nuanced information necessary for functioning effectively in real-world environments. One notable subgroup of these models are those capable of producing natural language text, which can take visual and linguistic input and generate free-form text responses. For this study, the focus was on leveraging vision-language models pre-trained on internet-scale data and adapting them for direct, closed-loop robot control.
Training Vision-Language Models with Robot-Awareness
The primary challenge in adapting vision-language models for robotics lies in teaching them to output robotic actions while preserving their existing knowledge of language and vision. To do so, the research team transformed the robot actions into text tokens and incorporated them into the training set as natural language tokens. This approach allows the model weights to be shared across language and action tasks and removes the need for action-only model layers.
Two previously-proposed vision-language models, PaLI-X and PaLM-E, were adapted for closed-loop robotic manipulation tasks. These models were fine-tuned using discretized robot actions in the form of 256 action tokens, which served as the ground truth for the fine-tuning process.
Robot Action Fine-Tuning
The next step was to implement robot-action fine-tuning, which consisted of tokenizing the actions into text tokens and creating "multimodal sentences" that respond to robotic instructions. Vision-language models were fine-tuned to perform instruction-following robotic policies by training them on robot trajectories and incorporating the tokenized actions directly into their output.
Based on the unique tokenization schemes used by the PaLI-X and PaLM-E models, specific tokens were reserved for action representation. These tokens were then used to represent actions as a sequence of eight integer numbers, capturing the 6-DoF positional and rotational displacement of the robot end-effector, the gripper extension, and episode termination.
Practical Considerations for Robot Tasks
Despite the potential for significant advances in robotic control, there remain practical challenges with introducing VLA models into real-time tasks. To address these issues, the researchers had to optimize the model size and inference speed to ensure that the robot could be controlled in real-time, without significant lag or computational bottlenecks.
Experimental Results and Findings
Over 6,000 robotic evaluation trials were conducted to assess the performance and capabilities of the RT-2 model. The model exhibited significant improvements in generalization across objects, scenes, and instructions. Additionally, it displayed a variety of emergent capabilities, including interpreting commands not present in robot training data, identifying the smallest or largest object in a scene, figuring out which object to use as an improvised hammer, and determining the best energy drink for someone who is tired.
Conclusion
Overall, the RT-2 model represents a major step forward in the integration of vision-language-action models for robotic control. By combining the knowledge and capabilities of internet-scale vision-language models with robotic control tasks, the approach offers significant improvements in generalization, semantic understanding, and reasoning. This research opens up new possibilities for the development of more advanced and adaptable robots that can perform various tasks in a wide range of real-world environments.
LoraHub: Boosting Generalization across Tasks with Efficient LoRA Composition
Authors: Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, Min Lin
Source & References: https://arxiv.org/abs/2307.13269
Introduction
In the rapidly evolving world of natural language processing, researchers have their eyes set on large language models (LLM) such as GPT-3 and OpenAI LLMs. As impressive as they are, their massive size often leads to issues concerning computational efficiency and memory usage during fine-tuning. To tackle these challenges, the authors of the paper "LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition" proposed an innovative method that employs Low-rank Adaptation (LoRA) technology for optimizing LLMs.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.