MoE-LLaVA, StepCoder, Infini-gram, Dolma & Weaver

Week 1, February 2024

Feb 05, 2024

∙ Paid

Greetings,

Welcome to the 44th edition of the State of AI. This issue showcases a fascinating array of breakthroughs, from MoE-LLaVA's innovative approach in large vision-language models to StepCoder's advancement in code generation through reinforcement learning. We explore the expansive potential of Infini-gram in scaling n-gram language models to unprecedented levels and delve into Dolma, an open corpus that is redefining the landscape of language model pretraining with its vast token range. Finally, we introduce Weaver, a pioneering foundation model that is set to revolutionize the domain of creative writing.

Each article in this edition is a window into the cutting-edge developments and diverse applications of AI, offering you a comprehensive and engaging journey through the latest in AI innovation. Enjoy this dive into the future of artificial intelligence!

Best regards,

State of AI

Get 7 day free trial

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback
Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Weaver: Foundation Models for Creative Writing

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Authors: Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng Jin, Junwu Zhang, Munan Ning, Li Yuan

Source and References: https://arxiv.org/abs/2401.15947

Introduction

Do you ever wonder how artificial intelligence can recognize and understand images and texts simultaneously? The MoE-LLaVA paper by Bin Lin and his team explores the development of a large vision-language model (LVLM), which effectively processes both visual and textual data. This achievement is crucial in the rapidly evolving field of large language models (LLMs) like GPT-3, which have demonstrated remarkable capabilities for instruction following and generalization.

What's an LVLM?

An LVLM is a multi-modal machine learning model that understands and generates human-like responses based on both images and text. These models have shown great potential in tasks such as image captioning, object detection, and human-like chatting. However, simply increasing the size of LVLMs can be computationally expensive and inefficient.

The Concept of Mixture of Experts (MoE)

Mixture of Experts (MoE) is a technique used in machine learning that allows multiple smaller models, or "experts," to work together on a specific problem, instead of using one larger and more complex model. By leveraging a MoE approach, the authors were able to construct a more efficient LVLM, called MoE-LLaVA, that can maintain performance at a reduced computational cost.

MoE-LLaVA and MoE-tuning

What makes MoE-LLaVA unique is its three-stage training strategy called MoE-tuning, which helps prevent performance degradation usually associated with multi-modal learning and model sparsity.

Stage I: Converting LLM to LVLM

In this stage, the authors trained an additional "projection layer" in their model called an MLP (Multi-Layer Perceptron) to map visual tokens (such as image features) to the input domain of LLMs. This way, the model could understand both text and visual inputs.

Stage II: Pre-training with General Multi-modal Understanding

The authors then fine-tuned their model with multi-modal instruction data, which helped empower the model with more capabilities and adaptability for tasks involving images and text.

Stage III: Sparsifying the Model and MoE Layers

In the final stage, the authors replicated the feed-forward neural networks (FFNs) from the earlier stages to initialize multiple experts and only trained the MoE layers in their model.

The resulting MoE-LLaVA model consists of a vision encoder, a visual projection layer (MLP), a word embedding layer, and multiple stacked MoE blocks. Each MoE layer is equipped with a "router," which calculates the importance of each token for each expert. The router then allocates the top-k experts to process the token, while keeping the remaining experts inactive.

Performance Highlights

Through their MoE-LLaVA model, the authors achieved remarkable performance on various visual understanding tasks, even surpassing other state-of-the-art models. Some key results include:

With just 3 billion sparsely activated parameters, MoE-LLaVA achieved performance on par with the LLaVA-1.5-7B model, which has more than twice as many parameters.
When compared to the LLaVA-1.5-13B, MoE-LLaVA achieved a 1.1% improvement on the POPE hallucination benchmark, a test designed to measure the model's ability to "imagine" or "hallucinate" objects in images.

Implications and Future Research

MoE-LLaVA and its innovative MoE-tuning strategy offer valuable insights into how we can develop efficient and effective multi-modal learning systems. By demonstrating that it's possible to achieve high-performance LVLMs while maintaining a reduced computational cost, the authors have set an encouraging baseline for future research in the field.

In a world where understanding images and texts is becoming increasingly important, MoE-LLaVA holds great potential in various applications, from chatbots to image recognition systems. As we continue to push the limits of machine learning in multi-modal contexts, the lessons learned from MoE-LLaVA will undoubtedly inform the next generation of AI models and their applications.

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

Authors: Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Junjie Shan, Caishuang Huang, Wei Shen, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, Tao Gui

Source and references: https://arxiv.org/abs/2402.01391

Introduction

Automatically generating source code that aligns with given specifications, also known as code generation or program synthesis, has seen significant advancements thanks to large language models (LLMs). However, some challenges remain in generating high-quality code that meets complex human requirements. One potential solution to improve code generation quality is to use reinforcement learning (RL) with compiler feedback. However, traditional RL methods struggle with lengthy code, making optimization with LLMs less effective. To tackle these challenges, the researchers propose StepCoder, a novel RL framework for code generation.

Get 20% off forever

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

MoE-LLaVA, StepCoder, Infini-gram, Dolma & Weaver

Week 1, February 2024

Contents

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Introduction

What's an LVLM?

The Concept of Mixture of Experts (MoE)

MoE-LLaVA and MoE-tuning

Stage I: Converting LLM to LVLM

Stage II: Pre-training with General Multi-modal Understanding

Stage III: Sparsifying the Model and MoE Layers

Performance Highlights

Implications and Future Research

StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback

Introduction

Keep reading with a 7-day free trial