Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI

Sep 17, 2024

∙ Paid

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation
DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Assumption-Lean and Data-Adaptive Post-Prediction Inference
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Causal Language Modeling Can Elicit Search and Reasoning Capabilities on Logic Puzzles
HiFi-CS: Towards Open Vocabulary Visual Grounding For Robotic Grasping Using Vision-Language Models
SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model
Towards Leveraging Contrastively Pretrained Neural Audio Embeddings for Recommender Tasks
Gaussian is All You Need: A Unified Framework for Solving Inverse Problems via Diffusion Posterior Sampling
SGFormer: Single-Layer Graph Transformers with Approximation-Free Linear Complexity
Affective Computing Has Changed: The Foundation Model Disruption
AnyBipe: An End-to-End Framework for Training and Deploying Bipedal Robots Guided by Large Language Models

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

Authors: Fa-Ting Hong, Yunfei Liu, Yu Li, Changyin Zhou, Fei Yu, Dan Xu

Source and references: https://arxiv.org/abs/2409.10281v1

Introduction

The paper proposes a hierarchical diffusion framework, termed DreamHead, that learns spatial-temporal correspondences in audio-driven talking head synthesis without compromising the model's intrinsic quality and adaptability.

Key Points

DreamHead learns dense facial landmarks as an intermediate representation to model the spatial-temporal correspondence for talking head synthesis.
DreamHead is constructed by a hierarchy of two diffusion structures: (i) Audio-to-landmark diffusion structure generates accurate and temporal-smooth jittering-less facial landmarks from the input audio, (ii) Landmark-to-image diffusion structure synthesizes high-fidelity portrait videos via learning spatial correspondences between landmarks and facial expressions.
Extensive experiments on two datasets show that DreamHead can estimate jitter-less landmark sequences, and produce spatially and temporally consistent portrait videos, outperforming state-of-the-art counterparts.

Methodology

The proposed hierarchical diffusion framework, DreamHead, first learns an audio-to-landmark diffusion to estimate a facial landmark sequence corresponding to the given temporal audio sequence. Then, a landmark-to-image diffusion is further designed to generate photo-realistic portrait videos with the predicted facial landmarks.

Results and Findings

Extensive experiments on the HDTF and MEAD datasets show that DreamHead can effectively learn spatial-temporal consistency and produce high-fidelity audio-driven talking head videos. Compared to state-of-the-art methods, DreamHead achieves the best performance across various evaluation metrics, including mouth landmark distance, landmark velocity, intersection-over-union, and audio-video synchronization.

Implications and Conclusions

The proposed hierarchical diffusion framework, DreamHead, demonstrates the effectiveness of using facial landmarks as an intermediate representation to model the spatial-temporal correspondence for high-quality audio-driven talking head synthesis.

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots

Authors: Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, Dong Yu

Source and references: https://arxiv.org/abs/2409.10277v1

Introduction

This paper introduces Cognitive Kernel, an open-source agent system designed towards the goal of generalist autopilots. Unlike copilot systems that rely on user input, autopilot systems must complete tasks independently by actively acquiring state information from the environment.

Get a group subscription

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

DreamHead: Learning Spatial-Temporal Correspondence via Hierarchical Diffusion for Audio-driven Talking Head Synthesis

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots

Introduction

Keep reading with a 7-day free trial