State of AI: Bi-Weekly Roundup

July 9, 2024

Jul 09, 2024

∙ Paid

1. Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

2. JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

3. SimPO: Simple Preference Optimization with a Reference-Free Reward

4. On Speeding Up Language Model Evaluation

5. Improving Alignment and Robustness with Circuit Breakers

6. ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

7. LLaMAX: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages

8. Multi-Object Hallucination in Vision-Language Models

9. Tailor3D: Customized 3D Assets Editing and Generation with Dual-Side Images

10. CrowdMoGen: Zero-Shot Text-Driven Collective Motion Generation

11. Stepping on the Edge: Curvature Aware Learning Rate Tuners

12. The Tug-of-War Between Deepfake Generation and Detection

13. Potential Based Diffusion Motion Planning

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Authors: Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

Source and references: https://arxiv.org/abs/2407.06189v1

Introduction

This paper introduces Video Self-Training with Augmented Reasoning (Video-STaR), a novel approach to enhance Large Vision-Language Models (LVLMs) by enabling the use of diverse labeled video datasets for visual instruction tuning.

Key Points

Video-STaR is the first video self-training approach for LVLMs, allowing the utilization of any labeled video dataset.
Video-STaR cycles between instruction generation and finetuning, improving general video understanding and adapting LVLMs to novel downstream tasks.
Video-STaR generates question-answer pairs by prompting LVLMs, filtering answers to only those that contain the original video labels, and retraining the model on the generated dataset.
Video-STaR utilizes the video labels as weak supervision for video instruction tuning.
Video-STaR demonstrates improved performance on video question-answering benchmarks and diverse downstream video tasks.

Methodology

Video-STaR cycles between three main steps: (1) Answer Generation, where an LVLM is prompted to generate answers for a given video and question, (2) Label Verification, where the generated answers are filtered to only those that contain the correct video labels, and (3) Instruction Tuning, where the LVLM is retrained on the filtered dataset.

Results and Findings

Video-STaR improved TempCompass video question-answering performance by 10% compared to the baseline Video-LLaVA model.
On downstream tasks, Video-STaR enhanced Kinetics700-QA accuracy by 20% and action quality assessment on FineDiving by 15%.
Video-STaR enabled the creation of a large, 1M video instruction tuning dataset - VSTaR-1M, sourced from diverse datasets and tasks.

Implications and Conclusions

The promising results of Video-STaR demonstrate the effectiveness of self-training approaches in enhancing LVLM capabilities for complex video content. This work opens new research avenues in expanding LVLM knowledge bases using readily available image and video datasets, and further advances in long-form video understanding.

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Authors: Yu Zeng, Vishal M. Patel, Haochen Wang, Xun Huang, Ting-Chun Wang, Ming-Yu Liu, Yogesh Balaji

Source and references: https://arxiv.org/abs/2407.06187v1

Introduction

The paper presents JeDi, a novel finetuning-free approach for personalized text-to-image generation that can operate on any number of reference images and preserve the appearance of custom subjects while generating diverse variations.

Key Points

JeDi is a finetuning-free text-to-image generation method with a novel joint-image diffusion model.
The authors present a simple and scalable data synthesis pipeline for generating a multi-image personalization dataset with images sharing the same subject.
The paper introduces novel architecture and sampling techniques such as coupled self-attention and image guidance for achieving high-fidelity personalization.

Methodology

The authors construct a synthetic dataset called Synthetic Same-Subject (S3) by prompting pre-trained text-to-image diffusion models to generate photo collages of the same subject and augmenting the backgrounds. They then train a joint-image diffusion model that learns the joint distribution of multiple related text-image pairs sharing a common subject, using coupled self-attention layers to capture relationships between the images.

Results and Findings

Experimental results show that the proposed JeDi model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming prior finetuning-based and finetuning-free personalization baselines. JeDi can faithfully preserve the identity represented by any number of reference images, even for challenging uncommon objects.

Implications and Conclusions

The authors present an effective technique for learning a finetuning-free personalization model that can generate personalized images based on any number of reference images without requiring expensive optimization or additional modules.

SimPO: Simple Preference Optimization with a Reference-Free Reward

Authors: Yu Meng, Mengzhou Xia, Danqi Chen

Source and references: https://arxiv.org/abs/2405.14734v2

Introduction

This paper proposes SimPO, a simple yet effective offline preference optimization algorithm for aligning large language models with human values and intentions.

Key Points

SimPO uses the average log probability of a sequence as the implicit reward, which better aligns with model generation and eliminates the need for a reference model.
SimPO introduces a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses.
SimPO consistently and significantly outperforms existing approaches, including Direct Preference Optimization (DPO) and its latest variants, across various state-of-the-art training setups and extensive instruction-following benchmarks.
SimPO does not significantly increase response length compared to the supervised fine-tuned (SFT) or DPO models, indicating minimal length exploitation.
The top-performing SimPO model, built on Llama3-8B-Instruct, achieves remarkable performance on the AlpacaEval 2 and Arena-Hard benchmarks, surpassing the publicly available Claude 3 Opus model.
Get 14 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

State of AI: Bi-Weekly Roundup

July 9, 2024

Contents

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

SimPO: Simple Preference Optimization with a Reference-Free Reward

Introduction

Key Points

Keep reading with a 7-day free trial