Greetings,
Welcome to the 51st edition of the State of AI. In this issue, we explore the frontiers of AI with a focus on generalist video generation through the Mora multi-agent framework, efficient fine-tuning of 100+ language models with LlamaFactory, and parameter-efficient reinforcement learning from human feedback with PERL. We also delve into the fascinating world of LLM Agent Operating Systems and high-fidelity human image personalization with FlashFace.
Each of these topics showcases the rapid advancements and innovative applications of AI, promising a captivating and informative read. We hope you find this edition insightful and thought-provoking.
Best regards,
Contents
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
LLM Agent Operating System
FlashFace: Human Image Personalization with High-fidelity Identity Preservation
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Authors: Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, Lichao Sun
Source and references: https://arxiv.org/abs/2403.13248
Introduction
Video generation has been a fascinating research area in Machine Learning, but popular models like OpenAI's Sora come with limitations such as closed-source code and focused capabilities that hinder the academic community from replicating or extending its success. In response to these challenges, researchers have designed Mora, a new multi-agent framework that leverages several advanced visual AI agents to replicate generalist video generation tasks like Sora.
Breaking Down Video Generation
Mora addresses the gap in video generation by decomposing the task into smaller subtasks, each assigned to a dedicated agent: prompt selection and generation, text-to-image generation, image-to-image generation, image-to-video generation, and video-to-video generation. These agents work collaboratively, automatically looping and permuting through the subtasks, completing a wide range of video generation tasks using a flexible pipeline.
Mora's multi-agent collaboration offers a unique approach to video generation that preserves the visual diversity, style, and quality inherent in text-to-image models, even facilitating editing capabilities. The framework's modular nature allows it to meet diverse user needs while maintaining competitive performance in tasks like text-to-video generation, text-conditional image-to-video generation, extending generated videos, video-to-video editing, connecting videos, and simulating digital worlds.
Agent Roles in Mora
Mora's agents are designed to handle specific tasks in the video generation process. They work together to ensure that the textual descriptions are thoroughly prepared and translated efficiently into visual representations. There are five key agents in Mora:
Prompt Selection and Generation Agent: This agent enhances the user-provided text prompts to perfectly prepare them for the video generation process.
Text-to-Image Generation Agent: This agent translates the enhanced textual descriptions into high-quality initial images.
Image-to-Image Generation Agent: This agent modifies a given source image based on specific textual instructions, handling the image editing task.
Image-to-Video Generation Agent: This agent extends the initial image to produce a vibrant video sequence while maintaining the temporal stability and visual consistency.
Video Connection Agent: This agent seamlessly connects two input videos based on user instructions, enabling smooth video transitions.
By coordinating the efforts of these agents, Mora proficiently handles a wide range of video tasks while offering superior editing flexibility and visual fidelity.
Experimental Results and Performance
Mora's performance was assessed using basic metrics from the publicly available video generation benchmark Vbench and self-defined metrics for its six task capabilities. Results show that Mora offers superior performance in text-to-video generation tasks compared to existing open-sourced models, ranking second only to Sora.
Additionally, Mora demonstrated competitive results in the remaining five tasks, highlighting the versatility and general capabilities of the framework. This impressive achievement underlines the effectiveness of Mora, showcasing its potential as a groundbreaking tool in video generation.
Key Takeaways
Mora is an innovative framework, specifically designed to enable multi-agent collaboration in video generation tasks. It can handle various video-related tasks by integrating and coordinating text-to-image, image-to-image, image-to-video, and video-to-video agents through a flexible and adaptable system. This collaborative approach allows Mora to produce high-quality video content that rivals the performance of established models like Sora.
By introducing Mora, researchers have unlocked new frontiers in video generation, pushing the boundaries of creativity and expression in digital media. Its exceptional performance across multiple tasks is promising, and its potential use cases are vast, from filmmaking and robotics to healthcare. While the closed-source nature of Sora limits academic advancement, Mora paves the way for an open, collaborative future in video generation capabilities.
LLAMA FACTORY: Unified Efficient Fine-Tuning of 100+ Language Models
Authors: Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Yongqiang Ma
Source and references: https://arxiv.org/abs/2403.13372
Large language models (LLMs) have become revolutionary in the world of artificial intelligence, powering various applications such as question answering, machine translation, and information extraction. With thousands of models available through open-source communities like Hugging Face, it's no surprise that fine-tuning these models has become extremely important in adapting them to downstream tasks. However, the challenge lies in efficiently fine-tuning LLMs while using limited resources.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.