LLMs Meet the Real World: Robustness in Tabular Data, Circuit Diagrams, and 3D Spatial Reasoning
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to Today’s Edition of State of AI
👋 And a warm welcome to our 443 new subscribers since the last edition!
Today’s lineup explores how far we’ve come in making LLMs reason, remember, and interact not just with text, but with tables, tools, timelines, and the physical world. From memory-augmented factual generation to curiosity-driven AI scientists navigating complex ecosystems, these papers cover both cutting-edge theory and gritty practical systems.
Here’s what caught our attention:
LLMs playing Telephone shows how information mutates across multi-turn model interactions and why cultural attractors matter.
Schemato tackles netlist-to-schematic conversion with an LLM fine-tuned for circuit design.
Ewe adds working memory to boost factual accuracy without sacrificing helpfulness.
SpatialLLM raises the bar for 3D spatial reasoning in multimodal models.
Safety at Scale surveys every known threat to large models and where defenses still fall short.
And Tiny Transformers on embedded FPGAs prove you can get low-latency, energy-efficient time-series analysis at the edge.
Whether you care about robustness in tabular reasoning, optimizing multimodal retrieval, or decoding the anatomy of diffusion-based planning for autonomous vehicles, there’s something here for you.
Let’s jump in 👇
Contents
Safety at Scale: A Comprehensive Survey of Large Model Safety
OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Efficient Generative Modeling with Residual Vector Quantization-Based Tokens
Automating Versatile Time-Series Analysis with Tiny Transformers on Embedded FPGAs
Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
Hume: Introducing System-2 Thinking in Visual-Language-Action Model
DiffVLA: Vision-Language Guided Diffusion Planning for Autonomous Driving
How well do LLMs reason over tabular data, really?
Authors: Cornelius Wolff, Madelon Hulsebos
Source and references: https://arxiv.org/abs/2505.07453v2
Introduction
This paper examines the reasoning capabilities of large language models (LLMs) over tabular data, and investigates whether they are robust to realistic variations in tabular inputs.
Key Points
The paper surfaces limitations of common free-form text evaluation metrics like SacreBleu and BERT-score for assessing tabular reasoning capabilities.
It proposes using an LLM-as-a-judge approach as a more reliable evaluation method, and finds a significant deficit in tabular reasoning accuracy of LLMs compared to previous studies.
The paper formalizes three common characteristics of tabular data - missing values, duplicate entities, and structural variations - and evaluates the robustness of LLMs to these variations.
Experiments show that LLMs struggle with these realistic characteristics of tabular inputs, highlighting the need for more robust models for tabular reasoning.
Methodology
The paper leverages the TQA-Bench benchmark for tabular reasoning tasks, but makes revisions to improve the validity of the queries and downscale the tabular inputs to gain more granular insights. It then uses an LLM-as-a-judge approach for reliable evaluation of the LLM-generated answers.
Results and Findings
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.