Panoramic Diffusion, Multimodal Anomaly Detection, and Efficient Robotic Policies
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today's edition of State of AI 🤖
👋 And a warm welcome to our 66 new subscribers since last edition!
This edition covers some fascinating research on generating coherent panoramic images using large language models, novel techniques for detecting anomalies in tabular data, and methods for deploying efficient robotic control policies on mobile devices. We'll also delve into advancements in multimodal models for Chinese hate speech detection and efficient document-level relation extraction.
Here's what caught our attention:
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs - A novel framework that redefines panoramic image generation as a next-token prediction task, enabling endless and coherent panorama generation.
Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data - A framework that integrates diffusion-based noise scheduling and contrastive learning to enhance tabular data representation and anomaly detection performance.
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation - A method to accelerate diffusion-based robotic control policies for real-time deployment on mobile devices, achieving significant latency improvements without compromising performance.
Let's get into it 👇
Contents
Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity
Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking
PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs
YOLO-Count: Differentiable Object Counting for Text-to-Image Generation
Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management
Towards Fair In-Context Learning with Tabular Foundation Models
Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models
GLiDRE: Generalist Lightweight model for Document-level Relation Extraction
On-Device Diffusion Transformer Policy for Efficient Robot Manipulation
Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics
Authors: Tom Or, Omri Azencot
Source and references: https://arxiv.org/abs/2508.00784v1
Introduction
This paper proposes a novel approach for detecting synthetic content across multiple modalities, including images and audio, by leveraging the latent representations of large pre-trained multi-modal models.
Key Points
The paper extends the recent paradigm of using CLIP-ViT features for deepfake detection to the multi-modal setting, providing an in-depth analysis of such models.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.