AppAgent: Multimodal Agents as Smartphone Users
Authors: Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, Gang Yu
Source and references: https://arxiv.org/abs/2312.13771
Introduction
As artificial intelligence advances, there has been a growing interest in large language models (LLMs) to complete complex tasks, going beyond simple language processing and entering the realm of cognitive abilities. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. This framework allows the agent to interact with different apps like a human user, taking actions through a simplified action space that mimics human-like interactions.
The Innovative Framework and Environment
The authors' main contribution is that they created an open-source multimodal agent framework, focusing on operating smartphone apps through a developed action space without the need for system back-end access. The experimental environment is built on a command-line interface, and the framework is tested on the Android operating system.
The agent's action space consists of common human interactions with smartphones, such as tapping and swiping. The researchers have designed four basic functions: Tap, Long_press, Swipe, and Text. These predefined actions aim to simplify the agent’s interactions, particularly by eliminating the need for precise screen coordinates, which can pose challenges for language models in accurately predicting.
Exploration Phase: Learning to Use Apps
The core of the agent's learning process is the exploration phase, where it learns about the functionalities of various smartphone apps. The agent can learn by one of two methods: autonomous interactions or by observing human demonstrations.
Autonomous Interactions
In autonomous interactions, the agent interacts with the app by itself and observes the changes in the app interface in response to the different actions it takes. The agent utilizes its existing knowledge from large language models to improve the exploration process, ensuring it focuses on essential elements for the app's operation. The exploration stops when the agent completes the assigned task.
Observing Human Demonstrations
Another more effective exploration method involves the agent observing human demonstrations. A human user operates the app, and the agent records the elements and actions employed. This strategy narrows down the exploration space and prevents the agent from engaging with irrelevant app pages, making it a more streamlined and efficient approach compared to autonomous interactions.
Deployment Phase: Executing Complex Tasks
After the exploration phase, the agent can use the knowledge it acquired to carry out complex tasks. When given a task, the agent follows a step-by-step approach using screenshots of the app's interface and a dynamically generated document detailing the functions of UI elements and the actions' effects on the current UI page.
The agent first provides observations of the app interface, followed by its thoughts regarding the task and its observations. It then proceeds to execute actions based on available functions, summarizing the interaction history after each step. The deployment phase ends when the agent considers the task accomplished.
Evaluating the Multimodal Agent
To evaluate the framework, the authors conducted extensive experiments across 50 tasks in 10 different apps. These tasks spanned a wide range of applications, such as social media, messaging, email, maps, shopping, and complex image editing apps. Both quantitative results and user studies underscored the advantages of this design, including its adaptability, user-friendliness, and efficient learning and operating capabilities.
The experimental results support the framework's potential, demonstrating the versatility and effectiveness of this agent in the realm of smartphone app operation.
Final Thoughts
The paper's authors developed a unique and versatile multimodal agent framework capable of operating smartphone apps akin to human users. By using an exploration phase to learn about app functionalities and a deployment phase to execute complex tasks, the framework has proven to be adaptable, user-friendly, and efficient across various applications.
This research marks an essential contribution to the AI-assisted smartphone app operation, highlighting the potential for future development in this area, along with larger language models and cognitive agents capable of performing complex tasks beyond merely language processing.