Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Nov 05, 2024

∙ Paid

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Adversarially Robust Decision Transformer
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs
Can Large Language Model Agents Simulate Human Trust Behavior?
Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI
Disrupting Test Development with AI Assistants
Adaptive Caching for Faster Video Generation with Diffusion Transformers
Training-free Regional Prompting for Diffusion Transformers
Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Learning to Price Homogeneous Data
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages
EMMA: End-to-End Multimodal Model for Autonomous Driving
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Authors: Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber

Source and references: https://arxiv.org/abs/2308.00352v7

Introduction

The paper introduces MetaGPT, an innovative meta-programming framework that incorporates efficient human workflows into LLM-based multi-agent collaborations to enhance problem-solving capabilities.

Key Points

MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, allowing agents with human-like domain expertise to verify intermediate results and reduce errors.
MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together.
On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems.

Methodology

MetaGPT models a group of agents as a simulated software company, with role specialization, workflow management, and efficient sharing mechanisms such as message pools and subscriptions. It also uses an executable feedback mechanism to enhance code generation quality during runtime.

Results and Findings

MetaGPT achieves state-of-the-art performance on the HumanEval and MBPP benchmarks, outperforming previous LLM-based approaches. On the more challenging SoftwareDev dataset, MetaGPT significantly outperforms ChatDev in terms of executability, cost, code statistics, and human revision cost.

Implications and Conclusions

The successful integration of human-like SOPs in MetaGPT inspires future research on human-inspired techniques for artificial multi-agent systems and the regulation of LLM-based multi-agent frameworks.

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Authors: Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han

Source and references: https://arxiv.org/abs/2408.10188v5

Introduction

This paper introduces LongVILA, a comprehensive solution for scaling long-context visual language models (VLMs) to process long videos. LongVILA addresses the challenges of long-context modeling through a multi-stage training pipeline and a novel distributed training system.

Key Points

LongVILA upgrades existing VLMs to support long video understanding by incorporating long context extension and long video supervised fine-tuning stages.
The paper presents the Multi-Modal Sequence Parallelism (MM-SP) system, an efficient and scalable framework for training and inferencing long-context VLMs.
LongVILA-7B demonstrates strong performance on the VideoMME benchmark, achieving 61.8% accuracy with subtitles.
The LongVILA model, trained on 2048 frames, reaches 99.8% accuracy in a needle-in-a-haystack experiment with 6,000 frames, over 1 million tokens.
Increasing the number of video frames using LongVILA consistently improves performance on VideoMME and long video captioning tasks.

Methodology

The authors develop a five-stage training pipeline for LongVILA, including multi-modal alignment, large-scale pre-training, short supervised fine-tuning, context extension for language models, and long supervised fine-tuning. They also introduce the MM-SP system, which efficiently parallelizes long video training and inference.

Results and Findings

LongVILA-7B achieves 61.8% accuracy on the VideoMME benchmark with subtitles. The LongVILA model trained on 2048 frames reaches 99.8% accuracy in a 6,000-frame needle-in-a-haystack experiment, with a context length of over 1 million tokens. Increasing the number of video frames consistently improves performance on VideoMME and long video captioning tasks. The MM-SP system can efficiently scale the context length up to 2 million tokens without gradient checkpointing, achieving 2.1x to 5.7x speedup compared to ring-style sequence parallelism, and 1.1x to 1.4x compared to Megatron with a hybrid context and tensor parallelism.

Implications and Conclusions

The research presented in this paper provides a comprehensive solution for scaling long-context VLMs to process long videos, addressing both algorithmic and system-level challenges. The proposed training pipeline and MM-SP system enable VLMs to effectively leverage long-context information, with significant improvements in performance on long video understanding tasks.

Adversarially Robust Decision Transformer

Authors: Xiaohang Tang, Afonso Marques, Parameswaran Kamalaruban, Ilija Bogunovic

Source and references: https://arxiv.org/abs/2407.18414v2

Introduction

This paper introduces Adversarially Robust Decision Transformer (ARDT), a worst-case-aware Reinforcement Learning via Supervised Learning (RvS) algorithm designed to improve the adversarial robustness of the Decision Transformer (DT) method.

Get 7 day free trial

Give a gift subscription

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Adversarially Robust Decision Transformer

Introduction

Keep reading with a 7-day free trial