Scaling Transformers, Video-Language Models, and Collaborative Reasoning

Latest research summaries in ML, Robotics, CV, NLP and AI

Jan 18, 2026

∙ Paid

Welcome to today’s edition of State of AI 🤖 👋

This edition covers scaling transformer architectures and developing advanced video-language models, to novel multi-agent reasoning frameworks and benchmarks for cultural and multilingual video understanding. We also delve into the origins of neural scaling laws and reinforcement learning approaches that promote creative problem-solving.

Here’s what caught our attention:

STEM: Scaling Transformers with Embedding Modules - A static, token-indexed approach that decouples parametric capacity from per-token compute, reducing FLOPs and parameter accesses while improving downstream accuracy.
Molmo2: Open Weights and Data for Vision-Language Models - A new family of open-source video-language models that demonstrate exceptional grounding capabilities, matching or surpassing prior open models and even proprietary systems.
Collaborative Multi-Agent Test-Time Reinforcement Learning - A framework that leverages structured textual experience to enhance the capabilities of collaborative multi-agent systems, leading to improved performance across various domains.

Let’s get into it 👇

Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization

Source and references: https://arxiv.org/abs/2507.20923v3

Introduction

This paper introduces a novel framework called MPaGE for automatically designing heuristics to solve multi-objective combinatorial optimization problems (MOCOP). MPaGE leverages large language models (LLMs) and Pareto Front Grid (PFG) techniques to discover a diverse set of heuristics that jointly optimize solution quality and runtime efficiency.

Key Points

MPaGE is the first framework to systematically combine LLMs with the Simple Evolutionary Multiobjective Optimization (SEMO) paradigm and PFG.
It uses LLMs to verify the logical structure of heuristics and perform cross-cluster recombination, enhancing diversity and reducing redundancy.
Through extensive experiments on standard MOCOP benchmarks, MPaGE demonstrates consistent improvements in runtime efficiency, solution quality, and semantic diversity over LLM-based baselines and traditional multi-objective evolutionary algorithms (MOEAs).

Methodology

MPaGE partitions the objective space into grid cells using PFG and retains top-performing candidates to guide heuristic generation. It then employs LLMs to assess the semantic structures of the candidate heuristics, clustering them into groups of similar logic. Variation is then performed with respect to these clusters, promoting semantic diversity and mitigating redundancy within the heuristic population.

Results and Findings

Continue reading this post for free, courtesy of State of AI.

Or purchase a paid subscription.