State of AI

State of AI

Panoramic Diffusion, Multimodal Anomaly Detection, and Efficient Robotic Policies

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Aug 05, 2025
∙ Paid
5
Share

Welcome to today's edition of State of AI 🤖

👋 And a warm welcome to our 66 new subscribers since last edition!

This edition covers some fascinating research on generating coherent panoramic images using large language models, novel techniques for detecting anomalies in tabular data, and methods for deploying efficient robotic control policies on mobile devices. We'll also delve into advancements in multimodal models for Chinese hate speech detection and efficient document-level relation extraction.

Here's what caught our attention:

  • PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs - A novel framework that redefines panoramic image generation as a next-token prediction task, enabling endless and coherent panorama generation.

  • Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data - A framework that integrates diffusion-based noise scheduling and contrastive learning to enhance tabular data representation and anomaly detection performance.

  • On-Device Diffusion Transformer Policy for Efficient Robot Manipulation - A method to accelerate diffusion-based robotic control policies for real-time deployment on mobile devices, achieving significant latency improvements without compromising performance.

Let's get into it 👇

Contents

  1. Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

  2. Federated Cross-Training Learners for Robust Generalization under Data Heterogeneity

  3. Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking

  4. LLaVA-Video: Video Instruction Tuning With Synthetic Data

  5. PanoLlama: Generating Endless and Coherent Panoramas with Next-Token-Prediction LLMs

  6. YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

  7. Adacc: Adaptive Compression and Activation Checkpointing for LLM Memory Management

  8. Towards Fair In-Context Learning with Tabular Foundation Models

  9. Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data

  10. Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models

  11. MMBERT: Scaled Mixture-of-Experts Multimodal BERT for Robust Chinese Hate Speech Detection under Cloaking Perturbations

  12. GLiDRE: Generalist Lightweight model for Document-level Relation Extraction

  13. Video Generators are Robot Policies

  14. On-Device Diffusion Transformer Policy for Efficient Robot Manipulation

  15. FalconGym: A Photorealistic Simulation Framework for Zero-Shot Sim-to-Real Vision-Based Quadrotor Navigation

Unraveling Hidden Representations: A Multi-Modal Layer Analysis for Better Synthetic Content Forensics

Authors: Tom Or, Omri Azencot

Source and references: https://arxiv.org/abs/2508.00784v1


Introduction

This paper proposes a novel approach for detecting synthetic content across multiple modalities, including images and audio, by leveraging the latent representations of large pre-trained multi-modal models.

Key Points

  • The paper extends the recent paradigm of using CLIP-ViT features for deepfake detection to the multi-modal setting, providing an in-depth analysis of such models.

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture