Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Sep 10, 2024

∙ Paid

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
The Future of Software Testing: AI-Powered Test Case Generation and Validation
Espresso: Robust Concept Filtering in Text-to-Image Models
A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification
Improving Antibody Design with Force-Guided Sampling in Diffusion Models
An Introduction to Quantum Reinforcement Learning (QRL)
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
Improving Pretraining Data Using Perplexity Correlations
Promptable Closed-loop Traffic Simulation
VFA: Vision Frequency Analysis of Foundation Models and Human

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

Source and references: https://arxiv.org/abs/2409.04429v1

Introduction

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models.

Key Points

VILA-U uses a unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, enhancing visual perception.
Autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset.
VILA-U performs comparably to more complex models using a fully token-based autoregressive framework.

Methodology

VILA-U's vision tower encoder processes visual inputs sequentially, generating a 1D token sequence. This sequence is then concatenated with text tokens to form a multi-modal sequence. The model is trained using a unified next-token prediction objective for both visual and text tokens.

Results and Findings

VILA-U significantly narrows the gap in visual understanding performance between end-to-end autoregressive models and continuous-token VLMs, while introducing competitive native visual generation capabilities. It achieves near state-of-the-art performance on visual language understanding and generation benchmarks.

Implications and Conclusions

VILA-U's unified framework simplifies the model architecture while maintaining high performance, demonstrating the feasibility of autoregressive methods for integrating visual and language modalities.

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Authors: Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan

Source and references: https://arxiv.org/abs/2409.04410v1

Introduction

This technical report introduces Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B parameters. The key focus of this work is to democratize the use of MAGVIT-v2's powerful visual tokenizer and explore its potential in plain auto-regressive models.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Introduction

Keep reading with a 7-day free trial