Agents with Theory-of-Mind, RT-X Models, Photorealistic Text-to-Image Synthesis and more 🚀
Week 2, October 2023
Greetings,
Welcome to the 27th edition of the State of AI. In this illuminating issue, we navigate the intricate pathways of large language models and their journey towards possessing a Theory-of-Mind. We'll further unravel the mysteries of what language models truly comprehend and delve into the impressive domain of robotic learning with Open X-Embodiment. The artistic fusion of text and image comes to life through PixArt-α and the advanced Kandinsky technique, pushing the boundaries of photorealistic text-to-image synthesis.
This edition promises to captivate, inform, and showcase the latest breakthroughs in AI. Dive in and be inspired!
Best regards,
Contents
How FaR Are Large Language Models From Agents with Theory-of-Mind?
Language Models (Mostly) Know What They Know
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion
How FaR Are Large Language Models From Agents with Theory-of-Mind?
Authors: Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida Nematzadeh, Shyam Upadhyay, Manaal Faruqui
Source & References: https://arxiv.org/abs/2310.03051v1
Introduction
Large Language Models (LLMs) have been making headlines for their prowess in a variety of natural language processing tasks. However, despite their performance on false belief tests, LLMs often struggle to infer the proper course of action in real-life scenarios that require social intelligence, such as understanding other people's mental states from observations—an ability known as Theory-of-Mind (ToM). This paper investigates LLMs' abilities to perform such tasks under a new evaluation paradigm called Thinking-for-Doing (T4D).
The Problem with Current Evaluation Paradigms
The authors identify a critical limitation in current LLM evaluation paradigms, which primarily focus on probing models' proficiencies in inferring mental states from scenarios, while neglecting the essential human capability of acting on inferred mental states. To bridge this gap, they propose the new Think-for-Doing (T4D) paradigm, which requires LLMs to connect inferences about others' mental states to actions in social scenarios.
Thinking for Doing (T4D) Task and Data
The authors formulate the T4D task to challenge LLMs' abilities to connect social reasoning to proper actions in real-world scenarios. In the T4D framework, the model's task is not to make an inference but to decide on an action based on inferred mental states. They programmatically convert the stories of a widely-used ToM benchmark, ToMi, from probing inferences to probing agent's action decisions due to ToMi's templatic nature. They also test human agreement on T4D to validate the task's alignment with human ToM perspectives.
LLMs Struggle on T4D While Humans Find It Easy
The authors test various LLMs, including PaLM 2, GPT-3.5, and GPT-4, on the T4D tasks and find that these models perform significantly worse than humans. LLMs' performance on the original ToMi set was much closer to human scores. This discrepancy highlights the challenges T4D poses for even the strongest contemporary LLMs.
What Makes T4D Challenging for LLMs?
To uncover the reasons behind LLMs' struggles on the T4D task, the authors provide hints in the form of oracle reasoning steps to help models pinpoint relevant inferences. They find that when models receive specific hints, their performance dramatically improves, approaching human levels. This suggests that the primary challenge LLMs face in T4D lies in identifying implicit inference steps needed to perform the proper actions.
The Foresee and Reflect (FaR) Framework
The authors introduce a new zero-shot prompting framework, Foresee and Reflect (FaR), designed to help LLMs better structure their reasoning process when tackling T4D tasks. The FaR framework consists of two components: Foresee, which prompts the models to predict future events based on observations, and Reflect, where models reason on which action choice better helps the characters with potential challenges. The experiments show that FaR substantially improves LLMs' performance on T4D by providing a reasoning structure that encourages the models to anticipate future challenges and reason about potential actions.
Generalization and Robustness of FaR
The authors conduct several studies to explore the strengths and limitations of the FaR framework. They investigate the importance of both foresight and reflection in improving LLMs and the sensitivity of LLMs to noisy future predictions. They also examine whether FaR overfits on the ToMi-converted T4D task or generalizes well to out-of-distribution story structures and scenarios that require ToM inferences to choose an action.
Conclusion and Future Directions
This paper presents a new evaluation paradigm, Thinking for Doing (T4D), which exposes the limitations of current LLM evaluation approaches in assessing model proficiency in connecting social reasoning to actions. They tested various LLMs on the T4D tasks and found that all of them struggle, highlighting the need for better approaches to inferring mental states from observations.
The authors' introduction of the Foresee and Reflect (FaR) prompting framework is a promising step towards improving LLMs' abilities to perform T4D tasks. Through a series of experiments and analyses, they demonstrate the potential of the FaR framework to boost LLM performance in tasks that require ToM inferences. This research not only sheds light on the challenges faced by LLMs in tasks involving social intelligence but also sets the stage for future work that aims to develop next-generation AI agents capable of understanding and acting upon human mental states.
Language Models (Mostly) Know What They Know
Authors: Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, Scott Johnston, Sheer El-Showk, Andy Jones, Nelson Elhage, Tristan Hume, Anna Chen, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep Ganguli, Danny Hernandez, Josh Jacobson, Jackson Kernion, Shauna Kravec, Liane Lovitt, Kamal Ndousse, Catherine Olsson, Sam Ringer, Dario Amodei, Tom Brown, Jack Clark, Nicholas Joseph, Ben Mann, Sam McCandlish, Chris Olah, Jared Kaplan
Source & References: https://arxiv.org/abs/2207.05221v4
Introduction
Current advancements in AI and language models revolve around achieving honesty and faithfulness in their claims and their evaluation of their own knowledge and reasoning. This research paper focuses on studying the extent to which language models (LMs) possess this self-evaluation ability, along with the corresponding improvements in calibration and generalization of these models.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.