Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI

Jul 26, 2024

∙ Paid

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models
Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models
Defending Our Privacy With Backdoors
Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Kernel Memory Networks: A Unifying Framework for Memory Modeling
A Quantum Leaky Integrate-and-Fire Spiking Neuron and Network
AutoGPT+P: Affordance-based Task Planning with Large Language Models
LLMmap: Fingerprinting For Large Language Models
AutoCodeRover: Autonomous Program Improvement
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
Harmonic LLMs are Trustworthy
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Automated Ensemble Multimodal Machine Learning for Healthcare

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Authors: Canyu Zhao, Mingyu Liu, Wen Wang, Jianlong Yuan, Hao Chen, Bo Zhang, Chunhua Shen

Source and references: https://arxiv.org/abs/2407.16655v1

Introduction

This paper proposes a novel hierarchical framework called MovieDreamer that integrates the strengths of autoregressive models and diffusion-based rendering to generate coherent long-form video content with complex narrative structures and intricate plot progressions.

Key Points

MovieDreamer marries autoregressive modeling with diffusion-based rendering to achieve a synthesis of long-term coherence and short-term fidelity in visual storytelling.
The model leverages autoregressive models to ensure global consistency in key movie elements like character identities, props, and cinematic styles, even amidst complex scene transitions.
The model predicts visual tokens for specific local timespans, which are then decoded into keyframes and dynamically rendered into video sequences using diffusion rendering.
The method employs a multimodal script that hierarchically structures rich descriptions of scenes as well as the characters' identities, facilitating narrative continuity and enhancing character control.
The approach demonstrates superior generation quality with detailed visual continuity, high-fidelity visual details, and identity-preserving character rendering.

Methodology

The approach utilizes a diffusion autoencoder to tokenize keyframes into compact visual tokens, which are then predicted in an autoregressive manner using a model initialized from a pretrained large language model (LLaMA). The autoregressive model is conditioned on a novel multimodal script that includes detailed character information and visual style descriptions. The predicted visual tokens are then decoded into images using a diffusion-based renderer, with an additional fine-tuning step to improve identity preservation.

Results and Findings

The authors present extensive experiments across various movie genres, demonstrating that their approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. The method showcases detailed visual continuity, high-fidelity visual details, and identity-preserving character rendering.

Implications and Conclusions

The proposed MovieDreamer framework represents a significant advancement in the field of long-form video generation, enabling the automated production of movies that are both visually cohesive and contextually rich. By combining the strengths of autoregressive and diffusion-based models, the method paves the way for the creation of engaging, coherent, and visually stunning long-duration video content.

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Authors: Zhenyu Xie, Haoye Dong, Yufei Gao, Zehua Ma, Xiaodan Liang

Source and references: https://arxiv.org/abs/2407.16511v1

Introduction

This paper proposes DreamVTON, a new method for the task of 3D virtual try-on that can generate high-quality 3D humans given person images, clothes images, and a text prompt as inputs.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models

Introduction

Keep reading with a 7-day free trial