Temporal Understanding and Efficient Text Generation

Latest research summaries in ML, Robotics, CV, NLP and AI

Oct 01, 2025

∙ Paid

Welcome to today’s edition of State of AI 👋 And a warm welcome to our 122 new subscribers since last edition!

This edition explores the latest advancements in multimodal reasoning capabilities, the ability of large language models to understand and reason about temporal information in medical and visual data, as well as novel approaches to efficient and high-fidelity text generation from limited resources.

Here’s what caught our attention:

Visual Serial Processing Deficits Explain Divergences in Human and VLM Reasoning: Investigating why vision language models (VLMs) often fail to match human performance on seemingly simple visual reasoning tasks, despite their success on standard benchmarks.
TemMed-Bench: Evaluating Temporal Medical Image Reasoning in Vision-Language Models: Introducing the first benchmark designed to evaluate the ability of large vision-language models to reason over temporal medical images and track changes in patients’ conditions.
Paired by the Teacher: Turning Unpaired Data into High-Fidelity Pairs for Low-Resource Text Generation: Presenting a two-stage teacher-student pipeline that synthesizes accurate input-output pairs without human labels or parallel data, enabling training of natural language generation models in low-resource scenarios.

Let’s get into it 👇

Visual serial processing deficits explain divergences in human and VLM reasoning

Authors: Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D. Cohen, Taylor W. Webb, Thomas L. Griffiths

Source and references: https://arxiv.org/abs/2509.25142v1

Introduction

This paper investigates why Vision Language Models (VLMs) often fail to match human performance on seemingly simple visual reasoning tasks, despite their success on standard benchmarks. The authors hypothesize that a key factor is a deficit in visually-grounded serial processing.

Key Points

The authors use human reaction time as a proxy for serial processing load to evaluate VLM performance across three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation.
Across all domains, the VLM-human performance gap widens as tasks require more demanding serial processing, whether composing concepts, enumerating items, or performing mental transformations.
The authors provide causal evidence for the serial processing deficit by showing that augmenting VLMs with forms of serial processing, such as chain-of-thought and tool use, can improve performance on specific tasks.
The results suggest that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

Methodology

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.