Contents
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Hermes: Memory-Efficient Pipeline Inference for Large Models on Edge Devices
Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
The Future of Software Testing: AI-Powered Test Case Generation and Validation
Espresso: Robust Concept Filtering in Text-to-Image Models
A Comprehensive Evaluation of Histopathology Foundation Models for Ovarian Cancer Subtype Classification
Improving Antibody Design with Force-Guided Sampling in Diffusion Models
An Introduction to Quantum Reinforcement Learning (QRL)
MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct
Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning
Improving Pretraining Data Using Perplexity Correlations
Promptable Closed-loop Traffic Simulation
VFA: Vision Frequency Analysis of Foundation Models and Human
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Authors: Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
Source and references: https://arxiv.org/abs/2409.04429v1
Introduction
VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models.
Key Points
VILA-U uses a unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, enhancing visual perception.
Autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset.
VILA-U performs comparably to more complex models using a fully token-based autoregressive framework.
Methodology
VILA-U's vision tower encoder processes visual inputs sequentially, generating a 1D token sequence. This sequence is then concatenated with text tokens to form a multi-modal sequence. The model is trained using a unified next-token prediction objective for both visual and text tokens.
Results and Findings
VILA-U significantly narrows the gap in visual understanding performance between end-to-end autoregressive models and continuous-token VLMs, while introducing competitive native visual generation capabilities. It achieves near state-of-the-art performance on visual language understanding and generation benchmarks.
Implications and Conclusions
VILA-U's unified framework simplifies the model architecture while maintaining high performance, demonstrating the feasibility of autoregressive methods for integrating visual and language modalities.
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
Authors: Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, Ying Shan
Source and references: https://arxiv.org/abs/2409.04410v1
Introduction
This technical report introduces Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B parameters. The key focus of this work is to democratize the use of MAGVIT-v2's powerful visual tokenizer and explore its potential in plain auto-regressive models.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.