LLMs Meet the Real World: Robustness in Tabular Data, Circuit Diagrams, and 3D Spatial Reasoning

Latest research summaries in ML, Robotics, CV, NLP and AI

Jun 03, 2025

∙ Paid

Welcome to Today’s Edition of State of AI
👋 And a warm welcome to our 443 new subscribers since the last edition!

Today’s lineup explores how far we’ve come in making LLMs reason, remember, and interact not just with text, but with tables, tools, timelines, and the physical world. From memory-augmented factual generation to curiosity-driven AI scientists navigating complex ecosystems, these papers cover both cutting-edge theory and gritty practical systems.

Here’s what caught our attention:

LLMs playing Telephone shows how information mutates across multi-turn model interactions and why cultural attractors matter.
Schemato tackles netlist-to-schematic conversion with an LLM fine-tuned for circuit design.
Ewe adds working memory to boost factual accuracy without sacrificing helpfulness.
SpatialLLM raises the bar for 3D spatial reasoning in multimodal models.
Safety at Scale surveys every known threat to large models and where defenses still fall short.
And Tiny Transformers on embedded FPGAs prove you can get low-latency, energy-efficient time-series analysis at the edge.

Whether you care about robustness in tabular reasoning, optimizing multimodal retrieval, or decoding the anatomy of diffusion-based planning for autonomous vehicles, there’s something here for you.

Let’s jump in 👇

How well do LLMs reason over tabular data, really?

Authors: Cornelius Wolff, Madelon Hulsebos

Source and references: https://arxiv.org/abs/2505.07453v2

Introduction

This paper examines the reasoning capabilities of large language models (LLMs) over tabular data, and investigates whether they are robust to realistic variations in tabular inputs.

Key Points

The paper surfaces limitations of common free-form text evaluation metrics like SacreBleu and BERT-score for assessing tabular reasoning capabilities.
It proposes using an LLM-as-a-judge approach as a more reliable evaluation method, and finds a significant deficit in tabular reasoning accuracy of LLMs compared to previous studies.
The paper formalizes three common characteristics of tabular data - missing values, duplicate entities, and structural variations - and evaluates the robustness of LLMs to these variations.
Experiments show that LLMs struggle with these realistic characteristics of tabular inputs, highlighting the need for more robust models for tabular reasoning.

Methodology

The paper leverages the TQA-Bench benchmark for tabular reasoning tasks, but makes revisions to improve the validity of the queries and downscale the tabular inputs to gain more granular insights. It then uses an LLM-as-a-judge approach for reliable evaluation of the LLM-generated answers.

Results and Findings

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.