Visual Tokens, Strategic Reasoning, Protein Design, and Cloud Debugging with LLMs
Latest research summaries in ML, Robotics, CV, NLP and AI
👋 Welcome to this week’s edition of State of AI, and a big hello to our 647 new subscribers since last edition!
This edition is stacked with research that rethinks the boundaries of AI systems, from how models reason about chemical reactions and strategy games to how they navigate human-filled environments and control physical robots. We’re seeing work that pushes into long-standing frontiers: hallucination in vision-language models, real-time protein design, memory systems in AI agents, and hardware-efficient decoding at scale. It’s a good week to be curious.
Some highlights:
ChemCoTBench raises the bar for LLMs in chemistry, trading shallow QA for modular reasoning grounded in real-world tasks.
TMGBench tests whether LLMs can actually think strategically across 2x2 game spaces—and where they still fall short.
Selftok proposes a unification of autoregressive and diffusion models through discrete visual tokens, with stunning performance boosts in vision tasks.
Dash, a low-code framework for cloud debugging with multi-modal RAG, hints at a future where LLMs help debug production systems in real-time.
Hume and EquAct bring System-2 reasoning and SE(3)-equivariance to robotics, enabling smarter manipulation and planning.
And don’t miss the papers on attention acceleration, protein generation, hallucination mitigation, and the evolving taxonomy of AI memory systems.
Let’s dig in 👇
Contents
Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs
Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations
TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs
Selftok: Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration
Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions
Designing Cyclic Peptides via Harmonic SDE with Atom-Bond Modeling
Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
EquAct: An SE(3)-Equivariant Multi-Task Transformer for Open-Loop Robotic Manipulation
Hume: Introducing System-2 Thinking in Visual-Language-Action Model
Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs
Authors: Yifan Wang, Kenneth P. Birman
Source and references: https://arxiv.org/abs/2505.21419v1
Introduction
This paper introduces Dash, a low-code development platform for creating AI applications in industrial settings, such as product inspection on factory shop floors.
Key Points
Dash aims to address three main challenges in low-code AI development: lack of composable tools for AI experts, difficulty in customizing AI models for deployment environments, and the need for better runtime support and troubleshooting.
Dash is designed to work with a distributed edge computing infrastructure, using Cascade as the underlying data and compute hosting framework.
Dash provides model recommendation and type checking features to assist AI experts in selecting and composing the right AI components.
Dash seeks to simplify the deployment process for non-expert deployment specialists by automating many customization tasks.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.