Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI

Aug 31, 2024

∙ Paid

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Mamba or Transformer for Time Series Forecasting? Mixture of Universals (MoU) Is All You Need
Flextron: Many-in-One Flexible Large Language Model
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
Examination of Code generated by Large Language Models
Low-Cost Language Models: Survey and Performance Evaluation on Python Code Generation
CogVLM2: Visual Language Models for Image and Video Understanding
GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models
Entropic Distribution Matching in Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

Authors: Nicholas R. Waytowich, Devin White, MD Sunbeam, Vinicius G. Goecks

Source and references: https://arxiv.org/abs/2408.15950v1

Introduction

This paper explores the potential of multimodal large language models (LLMs) as low-level controllers for Atari video games, introducing "Atari-GPT" as a new benchmark to evaluate their capabilities.

Key Points

Investigates whether multimodal LLMs can function effectively as low-level policies for Atari games
Assesses the models' visual understanding, spatial reasoning, and strategic intuition in the Atari gaming environment
Compares the performance of frontier LLMs, including GPT-4V Turbo, GPT-4o, and Gemini 1.5 Flash, to traditional reinforcement learning agents, random agents, and human players
Examines the impact of incorporating human-demonstration examples through In-Context Learning (ICL) on the LLMs' game-playing abilities

Methodology

The researchers conducted two main experiments. The Game-Play experiment assessed the LLMs' performance in playing Atari games, with and without ICL. The Understanding and Reasoning experiment evaluated the models' abilities in visual understanding, spatial reasoning, strategic intuition, and environment identification using a set of prompts and a rubric-based evaluation.

Results and Findings

On average, the LLMs achieved 10-25% of human-level performance in the Atari game-playing experiments, with GPT-4o demonstrating the highest overall performance.
The inclusion of ICL had little to no impact on the models' game-playing abilities, suggesting that they struggled to effectively learn and generalize from the provided human demonstrations.
In the Understanding and Reasoning experiment, GPT-4o exhibited the strongest performance across visual understanding, strategy formulation, and environment identification tasks, but struggled with spatial reasoning.

Implications and Conclusions

While the LLMs did not match the performance of human players or reinforcement learning agents, their ability to engage with Atari games at all highlights the potential for these models to function as low-level controllers in dynamic and visually complex environments. The findings suggest that further research and refinement of LLMs are needed to overcome their current limitations in spatial reasoning and few-shot learning to fully leverage their multimodal capabilities for low-level control tasks.

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Authors: Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang

Source and references: https://arxiv.org/abs/2408.15881v1

Introduction

This paper introduces LLaV A-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM).

Give a gift subscription

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

Atari-GPT: Investigating the Capabilities of Multimodal Large Language Models as Low-Level Policies for Atari Games

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Introduction

Keep reading with a 7-day free trial