Real World Simulators, Near-Infinite Context, Realistic Human Generation and more 🚀
Week 3, October 2023
Greetings,
Welcome to the 28th edition of the State of AI. This issue is packed with groundbreaking discoveries and innovations that are pushing the boundaries of AI capabilities. From the exploration of interactive real-world simulators to the learning capabilities of large language models for understanding rules, we're covering it all. Dive into Ring Attention technology, enabling near-infinite contextual understanding in machine learning. Get to know A Zero-Shot Language Agent that transforms computer control through structured reflection. Finally, behold HyperHuman, a state-of-the-art initiative in generating hyper-realistic human figures using latent structural diffusion.
In this edition, you will find a treasure trove of topics that reflect the rapid and ever-evolving advancements in the field of AI. It's an issue you won't want to miss!
Best regards,
Contents
Learning Interactive Real-World Simulators
Large Language Models can Learn Rules
Ring Attention with Blockwise Transformers for Near-Infinite Context
A Zero-Shot Language Agent for Computer Control with Structured Reflection
HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion
Learning Interactive Real-World Simulators: A Universal Approach to Generative Modeling
Authors: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, Pieter Abbeel
Source & References: https://arxiv.org/abs/2310.06114
Introduction
The paper, "Learning Interactive Real-World Simulators," explores the possibility of learning a universal simulator (UniSim) of real-world interaction through generative modeling. The authors develop a video generation framework that combines a wealth of diverse data sources—including text, image, video, robotics, navigation, and human activities—to create a highly capable simulator that models the visual outcome of complex interactions within real-world environments. The paper presents a variety of use cases for the UniSim, demonstrating its effectiveness in training high-level vision-language planners, low-level reinforcement learning policies, and video captioning models.
The Challenge: Building a Comprehensive Real-World Simulator
One of the primary obstacles to building a universal real-world simulator is the dataset availability. Although the internet hosts a vast amount of text, image, and video data, these disparate datasets must be meticulously assembled and fused to capture the full complexity of human experience. The authors tackle this challenge by extracting and processing observations and actions from a wide array of datasets, converting them into a common format suitable for integration into the UniSim framework.
Orchestrating Datasets Rich in Different Axes
To create a realistic world simulator, the authors train their model on diverse datasets, each providing a different aspect of the overall experience. These have to be structured and fused to simulate the range of human interactions with the world. The authors extract text and action data from numerous sources, including simulated renderings, real robot data, human activity videos, panorama scans, and internet text-image data. The authors then convert these diverse data types into standardized formats, allowing the UniSim to learn a single world model across all datasets.
Enabling Long-Horizon Interactions through Rollouts in POMDP
To support long-horizon interactions, the authors draw inspiration from partially observable Markov decision processes (POMDPs). By establishing a connection between POMDP and conditional video generation, the UniSim supports consistent, long-range interactions across video generation boundaries. UniSim is used as a transition function, enabling dynamic modeling at any temporal control frequency. Temporarily extended actions have been found to benefit various learning scenarios, such as hierarchical policies, skills, options, and more.
Simulating Real-World Interactions
The paper demonstrates UniSim's ability to model a variety of action-rich, long-horizon, and diverse interactions in real-world environments, showcasing its capabilities for training embodied planners, low-level control policies, and video captioning models.
Action-Rich Simulation
With natural language actions, UniSim generates complex human manipulations, such as kitchen tasks or switching on appliances, and integrates these actions into the simulated environment. This level of detail provides a more realistic and dynamic world model for training purposes.
Long-Horizon Simulation
UniSim supports the generation of temporally-extended observations, enabling the simulation of long-horizon interactions and sequential events. This capability helps maintain temporal consistency throughout the simulation process, ensuring that the model remains accurate and effective.
Diverse and Stochastic Simulations
Object interactions in UniSim are highly varied and detailed, and displayed responses are multimodal. This diversity makes for a more engaging and authentic simulation experience, supporting the development of well-rounded machine intelligence models.
Bridging the Sim-to-Real Gap
One of the primary benefits of the UniSim proposed is its ability to generalize to the real world after training solely in a simulated environment. Policies developed within the simulator can transfer to real-world settings in a zero-shot manner, making significant strides in closing the sim-to-real gap in embodied learning.
Conclusion
The paper presents a groundbreaking step in creating a universal real-world simulator through generative modeling. By efficiently combining and fusing diverse datasets across different axes and leveraging rollouts in POMDPs, UniSim can simulate complex, long-horizon interactions in a wide variety of environments. The applications for this technology are vast, with real-world transfer capabilities making it an ideal tool for training machine intelligence models that can perform effectively outside of the simulation.
Large Language Models Can Learn Rules
Authors: Zhaocheng Zhu, Yuan Xue, Xinyun Chen, Denny Zhou, Jian Tang, Dale Schuurmans, Hanjun Dai
Source & References: https://arxiv.org/abs/2310.07064
Introduction
Large Language Models (LLMs) like OpenAI's GPT-3 and GPT-4 have gained significant attention for their abilities in various tasks, including program synthesis, arithmetic reasoning, symbolic reasoning, and commonsense reasoning. These models rely on advanced prompting techniques, which enable them to perform complex reasoning tasks by decomposing problems into smaller steps. However, they often face the challenge of hallucination, generating plausible but incorrect answers when the knowledge required for the task is not consistent with the model's implicit knowledge.
To address this issue, the researchers propose a novel framework called "Hypotheses-to-Theories" (HtT), which enables LLMs to automatically induce rule libraries that can be used for reasoning tasks. This framework contains two stages: induction and deduction. In the induction stage, the LLM generates and verifies rules against a set of training examples, while in the deduction stage, the LLM uses the learned rule library to perform reasoning tasks on new test questions.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.