State of AI

State of AI

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Sep 10, 2024
∙ Paid

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup
1
Share

Contents

  1. VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

  2. Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

  3. Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices

  4. Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

  5. Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

  6. The Future of Software Testing: AI-Powered Test Case Generation and Validation

  7. Espresso: Robust Concept Filtering in Text-to-Image Models

  8. A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification

  9. Improving Antibody Design with Force-Guided Sampling in Diffusion Models

  10. An Introduction to Quantum Reinforcement Learning (QRL)

  11. MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

  12. Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

  13. Improving Pretraining Data Using Perplexity Correlations

  14. Promptable Closed-loop Traffic Simulation

  15. VFA: Vision Frequency Analysis of Foundation Models and Human

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu

Source and references: https://arxiv.org/abs/2409.04429v1


Introduction

VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models.

Key Points

  • VILA-U uses a unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, enhancing visual perception.

  • Autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset.

  • VILA-U performs comparably to more complex models using a fully token-based autoregressive framework.

Methodology

VILA-U's vision tower encoder processes visual inputs sequentially, generating a 1D token sequence. This sequence is then concatenated with text tokens to form a multi-modal sequence. The model is trained using a unified next-token prediction objective for both visual and text tokens.

Results and Findings

VILA-U significantly narrows the gap in visual understanding performance between end-to-end autoregressive models and continuous-token VLMs, while introducing competitive native visual generation capabilities. It achieves near state-of-the-art performance on visual language understanding and generation benchmarks.

Implications and Conclusions

VILA-U's unified framework simplifies the model architecture while maintaining high performance, demonstrating the feasibility of autoregressive methods for integrating visual and language modalities.


Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Authors: Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan

Source and references: https://arxiv.org/abs/2409.04410v1


Introduction

This technical report introduces Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B parameters. The key focus of this work is to democratize the use of MAGVIT-v2's powerful visual tokenizer and explore its potential in plain auto-regressive models.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share