State of AI

State of AI

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Nov 05, 2024
∙ Paid

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup
1
Share

Contents

  1. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

  2. LongVILA: Scaling Long-Context Visual Language Models for Long Videos

  3. Adversarially Robust Decision Transformer

  4. DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

  5. Can Large Language Model Agents Simulate Human Trust Behavior?

  6. Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI

  7. Disrupting Test Development with AI Assistants

  8. Adaptive Caching for Faster Video Generation with Diffusion Transformers

  9. Training-free Regional Prompting for Diffusion Transformers

  10. Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

  11. Learning to Price Homogeneous Data

  12. Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

  13. Prompting with Phonemes: Enhancing LLM Multilinguality for non-Latin Script Languages

  14. EMMA: End-to-End Multimodal Model for Autonomous Driving

  15. DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution


MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Authors: Sirui Hong, Mingchen Zhuge, Jiaqi Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, Jürgen Schmidhuber

Source and references: https://arxiv.org/abs/2308.00352v7


Introduction

The paper introduces MetaGPT, an innovative meta-programming framework that incorporates efficient human workflows into LLM-based multi-agent collaborations to enhance problem-solving capabilities.

Key Points

  • MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, allowing agents with human-like domain expertise to verify intermediate results and reduce errors.

  • MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together.

  • On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems.

Methodology

MetaGPT models a group of agents as a simulated software company, with role specialization, workflow management, and efficient sharing mechanisms such as message pools and subscriptions. It also uses an executable feedback mechanism to enhance code generation quality during runtime.

Results and Findings

MetaGPT achieves state-of-the-art performance on the HumanEval and MBPP benchmarks, outperforming previous LLM-based approaches. On the more challenging SoftwareDev dataset, MetaGPT significantly outperforms ChatDev in terms of executability, cost, code statistics, and human revision cost.

Implications and Conclusions

The successful integration of human-like SOPs in MetaGPT inspires future research on human-inspired techniques for artificial multi-agent systems and the regulation of LLM-based multi-agent frameworks.


LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Authors: Fuzhao Xue, Yukang Chen, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, Ethan He, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Linxi Fan, Yuke Zhu, Yao Lu, Song Han

Source and references: https://arxiv.org/abs/2408.10188v5


Introduction

This paper introduces LongVILA, a comprehensive solution for scaling long-context visual language models (VLMs) to process long videos. LongVILA addresses the challenges of long-context modeling through a multi-stage training pipeline and a novel distributed training system.

Key Points

  • LongVILA upgrades existing VLMs to support long video understanding by incorporating long context extension and long video supervised fine-tuning stages.

  • The paper presents the Multi-Modal Sequence Parallelism (MM-SP) system, an efficient and scalable framework for training and inferencing long-context VLMs.

  • LongVILA-7B demonstrates strong performance on the VideoMME benchmark, achieving 61.8% accuracy with subtitles.

  • The LongVILA model, trained on 2048 frames, reaches 99.8% accuracy in a needle-in-a-haystack experiment with 6,000 frames, over 1 million tokens.

  • Increasing the number of video frames using LongVILA consistently improves performance on VideoMME and long video captioning tasks.

Methodology

The authors develop a five-stage training pipeline for LongVILA, including multi-modal alignment, large-scale pre-training, short supervised fine-tuning, context extension for language models, and long supervised fine-tuning. They also introduce the MM-SP system, which efficiently parallelizes long video training and inference.

Results and Findings

LongVILA-7B achieves 61.8% accuracy on the VideoMME benchmark with subtitles. The LongVILA model trained on 2048 frames reaches 99.8% accuracy in a 6,000-frame needle-in-a-haystack experiment, with a context length of over 1 million tokens. Increasing the number of video frames consistently improves performance on VideoMME and long video captioning tasks. The MM-SP system can efficiently scale the context length up to 2 million tokens without gradient checkpointing, achieving 2.1x to 5.7x speedup compared to ring-style sequence parallelism, and 1.1x to 1.4x compared to Megatron with a hybrid context and tensor parallelism.

Implications and Conclusions

The research presented in this paper provides a comprehensive solution for scaling long-context VLMs to process long videos, addressing both algorithmic and system-level challenges. The proposed training pipeline and MM-SP system enable VLMs to effectively leverage long-context information, with significant improvements in performance on long video understanding tasks.


Adversarially Robust Decision Transformer

Authors: Xiaohang Tang, Afonso Marques, Parameswaran Kamalaruban, Ilija Bogunovic

Source and references: https://arxiv.org/abs/2407.18414v2


Introduction

This paper introduces Adversarially Robust Decision Transformer (ARDT), a worst-case-aware Reinforcement Learning via Supervised Learning (RvS) algorithm designed to improve the adversarial robustness of the Decision Transformer (DT) method.

Get 7 day free trial

Give a gift subscription

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share