Temporal Understanding and Efficient Text Generation
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today’s edition of State of AI 👋 And a warm welcome to our 122 new subscribers since last edition!
This edition explores the latest advancements in multimodal reasoning capabilities, the ability of large language models to understand and reason about temporal information in medical and visual data, as well as novel approaches to efficient and high-fidelity text generation from limited resources.
Here’s what caught our attention:
Visual Serial Processing Deficits Explain Divergences in Human and VLM Reasoning: Investigating why vision language models (VLMs) often fail to match human performance on seemingly simple visual reasoning tasks, despite their success on standard benchmarks.
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models: Introducing the first benchmark designed to evaluate the ability of large vision-language models to reason over temporal medical images and track changes in patients’ conditions.
Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation: Presenting a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data, enabling training of natural language generation models in low-resource scenarios.
Let’s get into it 👇
Contents
Visual serial processing deficits explain divergences in human and VLM reasoning
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline
Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models
Scaling Spoken Language Models with Syllabic Speech Tokenization
TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance
Branching Out: Broadening AI Measurement and Evaluation with Measurement Trees
AccidentBench: Benchmarking Multimodal Understanding and Reasoning in Vehicle Accidents and Beyond
Visual serial processing deficits explain divergences in human and VLM reasoning
Authors: Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D. Cohen, Taylor W. Webb, Thomas L. Griffiths
Source and references: https://arxiv.org/abs/2509.25142v1
Introduction
This paper investigates why Vision Language Models (VLMs) often fail to match human performance on seemingly simple visual reasoning tasks, despite their success on standard benchmarks. The authors hypothesize that a key factor is a deficit in visually-grounded serial processing.
Key Points
The authors use human reaction time as a proxy for serial processing load to evaluate VLM performance across three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation.
Across all domains, the VLM-human performance gap widens as tasks require more demanding serial processing, whether composing concepts, enumerating items, or performing mental transformations.
The authors provide causal evidence for the serial processing deficit by showing that augmenting VLMs with forms of serial processing, such as chain-of-thought and tool use, can improve performance on specific tasks.
The results suggest that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.
Methodology
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.


