Greetings,
In this milestone 45th edition of the State of AI, we explore cutting-edge innovations reshaping artificial intelligence. This includes advances in generalist computer agents like OS-Copilot, novel approaches to multi-agent systems, remarkable progress in mathematical reasoning through open language models, self-discovery techniques enabling large language models to compose their own reasoning structures, and groundbreaking achievements in chess playing without search—reaching grandmaster level performance. Each topic illuminates the rapidly evolving landscape of AI research and development, offering engaging insights and captivating breakthroughs. Dive in and let your curiosity be sparked!
Best regards,
Contents
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
More Agents Is All You Need
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Grandmaster-Level Chess Without Search
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement
Authors: Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, Lingpeng Kong.
Source and references: https://arxiv.org/abs/2402.07456
Introduction
Imagine a digital assistant that can interact seamlessly with your computer, perform complex tasks for you, and self-improve while doing so. This idea is now closer to reality, thanks to a research paper by a team of researchers from Shanghai AI Laboratory, East China Normal University, Princeton University, and The University of Hong Kong. They've introduced OS-Copilot, a framework designed to build generalist computer agents capable of interacting with all elements of an operating system (OS), including the web, code terminals, files, multimedia, and various third-party applications.
Meet OS-Copilot and FRIDAY
OS-Copilot provides a unified interface for interacting with the computer ecosystem, consolidating common practices like Python code interpreter, bash terminal, mouse/keyboard control, and API calls.
Using the OS-Copilot framework, the researchers built FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You), a self-improving embodied agent seamlessly integrated into the OS to automate general computer tasks. FRIDAY stands out from existing general-purpose agents like AutoGPT because it can learn to control unfamiliar applications through self-directed learning enabled by its self-evolving configurator.
To assess FRIDAY's problem-solving capabilities within the OS, the researchers evaluated its performance on GAIA, a benchmark for general AI assistants. FRIDAY achieved a success rate of 40.86% on the easiest level-1 tasks, outperforming previous methods by 35%. More impressively, FRIDAY managed to solve 6.12% of the challenging level-3 tasks, which no prior method succeeded in solving.
How Does It Work?
OS-Copilot functions by dividing a user's request into subtasks and then uses different components called Planner, Configurator, and Actor to execute these tasks.
The Planner
The Planner is responsible for decomposing complex requests into simpler subtasks. It comprehends the agent's capabilities to generate plans at the correct granularity and retrieves relevant information about the agent’s abilities, such as in-house tools and operating system information.
The Configurator
The Configurator takes a subtask from the Planner and configures it to help the Actor complete the subtask. The Configurator consists of working memory, declarative memory (User Profile and Semantic Knowledge), and procedural memory (Tool Repository).
The Actor
The Actor component has two stages: execution and self-criticism. The Executor uses the configuration prompt to complete the subtask by generating an executable command or function call with the correct parameters. The critic module then assesses the outcomes and offers feedback on the execution for self-correction and improvement.
Self-Directed Learning
FRIDAY has the ability to learn new skills autonomously through self-directed learning. With a pre-defined learning objective, such as mastering spreadsheet manipulation, FRIDAY proposes a continuous stream of tasks related to the objective. By solving these tasks, FRIDAY accumulates tools and semantic knowledge in the process.
Evaluation Results
FRIDAY showcased impressive performance on the GAIA benchmark, outperforming previous methods and demonstrating strong generalization across unseen applications. It notably surpassed a state-of-the-art model specifically designed for spreadsheet control after self-directed learning, achieving a success rate of 60% from initially failing all tasks.
Future Directions
The OS-Copilot framework and empirical results provide the groundwork and insights for future research towards more capable and general-purpose computer agents. There are, however, plenty of avenues to explore, such as developing more advanced planners, more accurate user profiling, and personalized language agents. Furthermore, researchers could investigate methods for better parsing and evaluating execution results in the Critic module to improve self-correction and learning.
Wrapping Up
The OS-Copilot framework and its implementation in FRIDAY offer an exciting glimpse into the future of digital assistants that can interact with a broad range of applications and learn new skills on their own. With continued research and development, we can expect even more capable and adaptable AI-powered assistants. These agents have the potential to transform how we interact with computers, provide personalized assistance, and make our digital lives more efficient and enjoyable.
More Agents Is All You Need
Authors: Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye
Source and references: https://arxiv.org/abs/2402.05120
Introduction
Ever since the rise of large language models (LLMs), researchers have been exploring ways to push the limits of their capabilities when tackling complex tasks. In this eye-opening paper by Li and co-authors, they tackle a fundamental question: can simply increasing the number of agents instantiated by an LLM lead to better performance? Their answer: Yes, it can!
They demonstrate that by combining multiple LLM agents and voting on the results, a more powerful solution can emerge, without relying on complicated techniques or frameworks. This method works across various tasks and LLMs, showing impressive results.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.