Scaling Transformers, Video-Language Models, and Collaborative Reasoning
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today’s edition of State of AI 🤖 👋
This edition covers scaling transformer architectures and developing advanced video-language models, to novel multi-agent reasoning frameworks and benchmarks for cultural and multilingual video understanding. We also delve into the origins of neural scaling laws and reinforcement learning approaches that promote creative problem-solving.
Here’s what caught our attention:
STEM: Scaling Transformers with Embedding Modules - A static, token-indexed approach that decouples parametric capacity from per-token compute, reducing FLOPs and parameter accesses while improving downstream accuracy.
Molmo2: Open Weights and Data for Vision-Language Models - A new family of open-source video-language models that demonstrate exceptional grounding capabilities, matching or surpassing prior open models and even proprietary systems.
Collaborative Multi-Agent Test-Time Reinforcement Learning - A framework that leverages structured textual experience to enhance the capabilities of collaborative multi-agent systems, leading to improved performance across various domains.
Let’s get into it 👇
Contents
From Single to Multi-Agent Reasoning: Advancing GeneGPT for Genomics QA
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning
On the origin of neural scaling laws: from random graphs to natural language
Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs
Collaborative Multi-Agent Test-Time Reinforcement Learning for Reasoning
PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization
TinyMyo: a Tiny Foundation Model for Flexible EMG Signal Processing at the Edge
Generative AI collective behavior needs an interactionist paradigm
Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization
Source and references: https://arxiv.org/abs/2507.20923v3
Introduction
This paper introduces a novel framework called MPaGE for automatically designing heuristics to solve multi-objective combinatorial optimization problems (MOCOP). MPaGE leverages large language models (LLMs) and Pareto Front Grid (PFG) techniques to discover a diverse set of heuristics that jointly optimize solution quality and runtime efficiency.
Key Points
MPaGE is the first framework to systematically combine LLMs with the Simple Evolutionary Multiobjective Optimization (SEMO) paradigm and PFG.
It uses LLMs to verify the logical structure of heuristics and perform cross-cluster recombination, enhancing diversity and reducing redundancy.
Through extensive experiments on standard MOCOP benchmarks, MPaGE demonstrates consistent improvements in runtime efficiency, solution quality, and semantic diversity over LLM-based baselines and traditional multi-objective evolutionary algorithms (MOEAs).
Methodology
MPaGE partitions the objective space into grid cells using PFG and retains top-performing candidates to guide heuristic generation. It then employs LLMs to assess the semantic structures of the candidate heuristics, clustering them into groups of similar logic. Variation is then performed with respect to these clusters, promoting semantic diversity and mitigating redundancy within the heuristic population.



