State of AI

State of AI

Share this post

State of AI
State of AI
LLMs Meet the Real World: Robustness in Tabular Data, Circuit Diagrams, and 3D Spatial Reasoning

LLMs Meet the Real World: Robustness in Tabular Data, Circuit Diagrams, and 3D Spatial Reasoning

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Jun 03, 2025
∙ Paid
7

Share this post

State of AI
State of AI
LLMs Meet the Real World: Robustness in Tabular Data, Circuit Diagrams, and 3D Spatial Reasoning
3
Share

Welcome to Today’s Edition of State of AI
👋 And a warm welcome to our 443 new subscribers since the last edition!

Today’s lineup explores how far we’ve come in making LLMs reason, remember, and interact not just with text, but with tables, tools, timelines, and the physical world. From memory-augmented factual generation to curiosity-driven AI scientists navigating complex ecosystems, these papers cover both cutting-edge theory and gritty practical systems.

Here’s what caught our attention:

  • LLMs playing Telephone shows how information mutates across multi-turn model interactions and why cultural attractors matter.

  • Schemato tackles netlist-to-schematic conversion with an LLM fine-tuned for circuit design.

  • Ewe adds working memory to boost factual accuracy without sacrificing helpfulness.

  • SpatialLLM raises the bar for 3D spatial reasoning in multimodal models.

  • Safety at Scale surveys every known threat to large models and where defenses still fall short.

  • And Tiny Transformers on embedded FPGAs prove you can get low-latency, energy-efficient time-series analysis at the edge.

Whether you care about robustness in tabular reasoning, optimizing multimodal retrieval, or decoding the anatomy of diffusion-based planning for autonomous vehicles, there’s something here for you.

Let’s jump in 👇

Contents

  1. How well do LLMs reason over tabular data, really?

  2. Exploring Flow-Lenia Universes with a Curiosity-driven AI Scientist: Discovering Diverse Ecosystem Dynamics

  3. When LLMs Play the Telephone Game: Cultural Attractors as Conceptual Tools to Evaluate LLMs in Multi-turn Settings

  4. Safety at Scale: A Comprehensive Survey of Large Model Safety

  5. OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

  6. SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

  7. Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

  8. Schemato - An LLM for Netlist-to-Schematic Conversion

  9. Automating Versatile Time-Series Analysis with Tiny Transformers on Embedded FPGAs

  10. Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

  11. Improving Factuality with Explicit Working Memory

  12. Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric

  13. Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

  14. Hume: Introducing System-2 Thinking in Visual-Language-Action Model

  15. DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving

How well do LLMs reason over tabular data, really?

Authors: Cornelius Wolff, Madelon Hulsebos

Source and references: https://arxiv.org/abs/2505.07453v2


Introduction

This paper examines the reasoning capabilities of large language models (LLMs) over tabular data, and investigates whether they are robust to realistic variations in tabular inputs.

Key Points

  • The paper surfaces limitations of common free-form text evaluation metrics like SacreBleu and BERT-score for assessing tabular reasoning capabilities.

  • It proposes using an LLM-as-a-judge approach as a more reliable evaluation method, and finds a significant deficit in tabular reasoning accuracy of LLMs compared to previous studies.

  • The paper formalizes three common characteristics of tabular data - missing values, duplicate entities, and structural variations - and evaluates the robustness of LLMs to these variations.

  • Experiments show that LLMs struggle with these realistic characteristics of tabular inputs, highlighting the need for more robust models for tabular reasoning.

Methodology

The paper leverages the TQA-Bench benchmark for tabular reasoning tasks, but makes revisions to improve the validity of the queries and downscale the tabular inputs to gain more granular insights. It then uses an LLM-as-a-judge approach for reliable evaluation of the LLM-generated answers.

Results and Findings

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share