Contents
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models
Imperfect Vision Encoders: Efficient and Robust Tuning for Vision-Language Models
Defending Our Privacy With Backdoors
Molecular Topological Profile (MOLTOP) -- Simple and Strong Baseline for Molecular Graph Classification
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Kernel Memory Networks: A Unifying Framework for Memory Modeling
A Quantum Leaky Integrate-and-Fire Spiking Neuron and Network
AutoGPT+P: Affordance-based Task Planning with Large Language Models
LLMmap: Fingerprinting For Large Language Models
AutoCodeRover: Autonomous Program Improvement
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference
Harmonic LLMs are Trustworthy
Recursive Introspection: Teaching Language Model Agents How to Self-Improve
Automated Ensemble Multimodal Machine Learning for Healthcare
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence
Authors: Canyu Zhao, Mingyu Liu, Wen Wang, Jianlong Yuan, Hao Chen, Bo Zhang, Chunhua Shen
Source and references: https://arxiv.org/abs/2407.16655v1
Introduction
This paper proposes a novel hierarchical framework called MovieDreamer that integrates the strengths of autoregressive models and diffusion-based rendering to generate coherent long-form video content with complex narrative structures and intricate plot progressions.
Key Points
MovieDreamer marries autoregressive modeling with diffusion-based rendering to achieve a synthesis of long-term coherence and short-term fidelity in visual storytelling.
The model leverages autoregressive models to ensure global consistency in key movie elements like character identities, props, and cinematic styles, even amidst complex scene transitions.
The model predicts visual tokens for specific local timespans, which are then decoded into keyframes and dynamically rendered into video sequences using diffusion rendering.
The method employs a multimodal script that hierarchically structures rich descriptions of scenes as well as the characters' identities, facilitating narrative continuity and enhancing character control.
The approach demonstrates superior generation quality with detailed visual continuity, high-fidelity visual details, and identity-preserving character rendering.
Methodology
The approach utilizes a diffusion autoencoder to tokenize keyframes into compact visual tokens, which are then predicted in an autoregressive manner using a model initialized from a pretrained large language model (LLaMA). The autoregressive model is conditioned on a novel multimodal script that includes detailed character information and visual style descriptions. The predicted visual tokens are then decoded into images using a diffusion-based renderer, with an additional fine-tuning step to improve identity preservation.
Results and Findings
The authors present extensive experiments across various movie genres, demonstrating that their approach not only achieves superior visual and narrative quality but also effectively extends the duration of generated content significantly beyond current capabilities. The method showcases detailed visual continuity, high-fidelity visual details, and identity-preserving character rendering.
Implications and Conclusions
The proposed MovieDreamer framework represents a significant advancement in the field of long-form video generation, enabling the automated production of movies that are both visually cohesive and contextually rich. By combining the strengths of autoregressive and diffusion-based models, the method paves the way for the creation of engaging, coherent, and visually stunning long-duration video content.
DreamVTON: Customizing 3D Virtual Try-on with Personalized Diffusion Models
Authors: Zhenyu Xie, Haoye Dong, Yufei Gao, Zehua Ma, Xiaodan Liang
Source and references: https://arxiv.org/abs/2407.16511v1
Introduction
This paper proposes DreamVTON, a new method for the task of 3D virtual try-on that can generate high-quality 3D humans given person images, clothes images, and a text prompt as inputs.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.