Dynamic Routing for LLMs, Unified Embodied Intelligence, and Real-Time Agentic Reasoning

May 15, 2026

∙ Paid

Welcome to today’s edition of State of AI 👋

This week brought a flurry of breakthroughs in making AI systems more efficient, more unified, and more real-time. We’re seeing a shift from monolithic approaches toward modular architectures that know when to be expensive, whether that’s routing computation dynamically between precision levels, combining specialized vision experts for robotics, or launching tool calls speculatively while maintaining correctness. Alongside this comes a wave of work on truly unified systems: embodied models that jointly learn understanding, reasoning, imagination and action from a single loop; VLAs for autonomous driving that preserve language understanding without sacrificing control precision; and world models that can generate minute-scale video with precise camera control at accessible compute budgets. The common thread? Strategic coupling of once-separate components, powered by careful system design rather than brute-force scaling.

Here’s what caught our attention:

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction — Step-level routing between full and quantized models achieves 1.01-1.58× speedups while recovering task success rates, using a lightweight router that identifies when quantization fails mid-trajectory.
OpenDeepThink: Parallel Reasoning via Bradley–Terry Aggregation — Pairwise LLM comparison reaches 86% accuracy versus 59% pointwise scoring for selecting among reasoning candidates, enabling parallel test-time compute without trained verifiers.
SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer — A 2.6B model generates 720p minute-long videos with 6-DoF camera control in 24.1 videos/hour on single H100, combining frame-wise linear attention with periodic softmax for long-context stability.
Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation — Replaces expensive trajectory precomputation with online single-step ODE supervision, achieving 4× training speedup (11.6k→2.9k GPU-hours) while improving frame-wise 2-step video generation to 50% lower latency than prior 4-step methods.
Pelican-Unified 1.0: A Unified Embodied Intelligence Model — Single model ranks #1 on WorldArena (66.03 EWM), achieves 93.5% on RoboTwin, and outperforms experienced human drivers on Waymo, proving understanding/reasoning/imagination/action can co-evolve through shared training rather than modular assembly.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture — First VLA to surpass experienced human drivers on WOD-E2E (8.20 vs 8.13 RFS) while preserving natural language interface through streaming memory and intent-conditioned action diffusion.
Speculative Interaction Agents: Building Real-Time Agents with Asynchronous I/O — Decouples agent reasoning from I/O delays via event-driven architecture and speculative tool calling, achieving 1.6-2.2× speedups on edge models while maintaining accuracy through read/write classification.
VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing — Distills DINOv2/CLIP/ViT into a mixture-of-experts with lightweight routing, reaching 74.7% average success across 17 manipulation tasks while suppressing task-irrelevant background information.

Let’s get into it 👇

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Authors: Yuanzhe Li, Jianing Deng, Jingtong Hu, Tianlong Chen, Song Wang, Huanrui Yang

Source and references: https://arxiv.org/abs/2602.02711v2

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Introduction

This paper addresses the computational cost of deploying large language models in long-horizon decision-making tasks by proposing Dynamic Mixed-Precision Routing (DMR), a framework that selectively routes computation between full-precision and quantized models at each decision step. The key insight is that different steps in multi-step reasoning have varying sensitivity to quantization, allowing selective use of expensive full-precision computation only where necessary.

Key Points

Step-level routing framework: DMR operates at the decision-step level rather than the coarser question level or finer token level, enabling fine-grained control suited for agentic tasks like web navigation and embodied reasoning.
Two-stage training pipeline: The approach combines KL-divergence-based supervised learning to identify precision-sensitive steps with reinforcement learning (GRPO) refinement to optimize task success under cost constraints.
Lightweight router design: The routing model comprises only 2-3% of the routed LLM parameters, making it computationally efficient while capable of identifying critical decision points where quantization fails.
Significant accuracy-cost trade-offs: DMR matches or exceeds full-precision performance while achieving 1.01-1.58x speedups over full-precision baselines, with controlled trade-offs via a budget parameter ρ.
Robust bimodal behavior: The framework exploits the observation that KL divergence between low- and high-precision models exhibits a clear two-regime structure: most steps show minimal deviation, while critical steps show substantial deviation.

Methodology

Continue reading this post for free, courtesy of State of AI.

Or purchase a paid subscription.

Dynamic Routing for LLMs, Unified Embodied Intelligence, and Real-Time Agentic Reasoning

Bi-Weekly AI Research Roundup

Contents

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Dynamic Mixed-Precision Routing for Efficient Multi-step LLM Interaction

Introduction

Key Points

Methodology

Continue reading this post for free, courtesy of State of AI.