State of AI

State of AI

Effective Multimodal Reasoning, Efficient Model Scaling, and Flexible Robotic Integration

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Sep 23, 2025
∙ Paid
8
2
Share

Welcome to today's edition of State of AI 👋 And a warm welcome to our new subscribers since last edition!


We’re opening up new sponsor spots! 🚀


Get your service, app, or site in front of 17,000 AI-focused readers.

Become a Sponsor!


This edition covers advanced techniques for enhancing the reasoning capabilities of large language models and novel approaches for improving the efficiency and flexibility of AI systems for real-world applications. We'll dive into research that explores the intersection of perception, language, and action in embodied settings, as well as methods for accelerating the inference of diffusion-based language models.

Here's what caught our attention:

  • Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers: Researchers demonstrate how the attention maps in multi-modal diffusion transformer blocks can be leveraged to achieve competitive zero-shot semantic segmentation.

  • MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction: A new framework for multimodal retrieval that enables efficient and scalable test-time interactions, advancing the state of the art in large-scale retrieval tasks.

  • Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning: The introduction of a novel benchmark designed to push the boundaries of symbolic reasoning in large language models, covering a diverse range of formal domains.

  • V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts: A graph-of-thoughts reasoning framework that enhances the perception, prediction, and planning capabilities of MLLM-based cooperative autonomous driving systems.

  • VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model: A new paradigm for building efficient and high-performing vision-language-action models, reducing the reliance on large-scale pre-training.

Let's get into it 👇

Contents

  1. Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates

  2. Neuromorphic Intelligence

  3. AI Copilots for Reproducibility in Science: A Case Study

  4. Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

  5. MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

  6. GraDeT-HTR: A Resource-Efficient Bengali Handwritten Text Recognition System utilizing Grapheme-based Tokenizer and Decoder-only Transformer

  7. GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance

  8. Optimizing Inference in Transformer-Based Models: A Multi-Method Benchmark

  9. Efficient Neural SDE Training using Wiener-Space Cubature

  10. Spiffy: Multiplying Diffusion LLM Acceleration via Lossless Speculative Decoding

  11. Reasoning Core: A Scalable RL Environment for LLM Symbolic Reasoning

  12. ARK-V1: An LLM-Agent for Knowledge Graph Question Answering Requiring Commonsense Reasoning

  13. How Good are Foundation Models in Step-by-Step Embodied Reasoning?

  14. V2V-GoT: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models and Graph-of-Thoughts

  15. VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates

Authors: Hy Dang, Tianyi Liu, Zhuofeng Wu, Jingfeng Yang, Haoming Jiang, Tao Yang, Pei Chen, Zhengyang Wang, Helen Wang, Huasheng Li, Bing Yin, Meng Jiang

Source and references: https://arxiv.org/abs/2509.18076v1


Introduction

This paper proposes a structured, template-based approach to enhance the function-calling capabilities of large language models (LLMs). The authors aim to guide LLMs through deliberate, step-by-step reasoning for function call generation, as opposed to relying on naive, unguided outputs.

Key Points

  • The authors develop an explicit prompting template that guides LLMs through critical stages of function calling, including tool understanding, parameter extraction, implicit conversion, and other task-specific requirements.

  • They introduce an approach for constructing a Guided-Template structured reasoning dataset (ToolGT) that effectively teaches models to improve accuracy and transparency across diverse tasks and model architectures.

  • Experimental results show that the template-based prompting and training methods consistently outperform both No-Thought and Chain-of-Thought (CoT) approaches across models and benchmarks.

  • On average, compared to CoT-prompting, Template-prompting achieves improvements of +2.8/+1.7 and Template-based fine-tuning yields +1.0/+1.3 on BFCLv2 and Nexus over CoT-trained models, respectively.

  • The authors argue that equipping LLMs with curriculum-style reasoning templates offers a path toward more reliable and generalizable tool use, as opposed to relying solely on unconstrained CoT reasoning.

Methodology

The authors present a structured, template-based reasoning framework designed to enhance the function-calling capabilities of LLMs. The approach comprises two main components: (1) prompting strategies, where the model is guided through critical reasoning steps using a structured template, and (2) fine-tuning strategies based on the proposed Guided-Template (ToolGT) dataset construction method.

Results and Findings

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture