The Efficiency Revolution: How Researchers Are Making AI Smarter Without the Bloat
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today’s edition of State of AI 👋 And a warm welcome to our new subscribers since last edition!
This Week’s Sponsor is Glitter.IO
If you’ve ever spent way too long turning a process into written documentation, Glitter AI flips the script. Record yourself walking through something (or just upload an existing video), and it automatically generates a polished step-by-step guide with screenshots and text. Think SOPs, onboarding docs, internal tooling walkthroughs, all done in seconds instead of hours.
Teams use it to document SOPs, handoffs, and training. Give it a go!
This issue covers establishing scalable architectures for massive-recommendation systems to benchmarking the offensive security capabilities of large language models. We also explore techniques for improving the efficiency and explainability of AI models.
Here’s what caught our attention:
Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems - Introduces a scalable architecture that achieves 2x efficiency improvement and enables the first predictable scaling laws for joint sequence-nonsequence modeling in recommendation systems.
ESTAR: Early-Stopping Token-Aware Reasoning for Efficient Inference - Presents a method that reduces reasoning length by ~3.7x while preserving 98.9% accuracy, highlighting early stopping as a powerful mechanism for improving efficiency in large reasoning models.
CyberExplorer: Benchmarking LLM Offensive Security Capabilities - Proposes a realistic attacking simulation environment to comprehensively evaluate the offensive security skills of large language models, uncovering their reasoning efficiency, coordination dynamics, and security-relevant signals.
Let’sdive in! 👇
Contents
ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
ALIVE: Animate Your World with Lifelike Audio-Video Generation
Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy
Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
Universal computation is intrinsic to language model decoding
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
ST4VLA: Spatially Guided Training for Vision-Language-Action Models
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Kunlun: Establishing Scaling Laws for Massive-Scale Recommendation Systems through Unified Architecture Design
Source and references: https://arxiv.org/abs/2602.10016v1
Introduction
This paper focuses on establishing predictable scaling laws for massive-scale recommendation systems that jointly model sequential user behaviors and non-sequential context features.
Key Points
Identifies poor scaling efficiency as the main barrier to predictable power-law scaling in recommendation systems, stemming from inefficient modules and suboptimal resource allocation.
Introduces Kunlun, a scalable architecture that systematically improves model efficiency and resource allocation through low-level optimizations (Generalized Dot-Product Attention, Hierarchical Seed Pooling, Sliding Window Attention) and high-level innovations (Computation Skip, Event-level Personalization).
Demonstrates predictable scaling behavior with Kunlun, achieving 2x scaling efficiency improvement over state-of-the-art methods and establishing the first scaling laws for joint sequence-nonsequence modeling at massive scale.
Kunlun has been deployed in major Meta Ads models, delivering significant production impact with a 1.2% improvement in topline metrics.
Methodology
Kunlun adopts a multi-layer architecture where each layer processes both sequence and non-sequence features through two main components: (1) a Kunlun Transformer Block for context-aware sequence modeling, and (2) a Kunlun Interaction Block for bidirectional information exchange between sequence and non-sequence features.
Results and Findings
Kunlun improves Model FLOPs Utilization (MFU) from 17% to 37% on NVIDIA B200 GPUs and achieves 2x scaling efficiency improvement over state-of-the-art approaches, enabling the first predictable scaling laws for joint sequence-nonsequence modeling in recommendation systems.
Implications and Conclusions
The research introduces an important step towards establishing scaling laws for massive-scale recommendation systems, which is crucial for guiding architectural decisions and resource allocation in these increasingly complex and important systems.
ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference
Authors: Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman
Source and references: https://arxiv.org/abs/2602.10004v1
Introduction
This paper introduces Early-Stopping for Token-Aware Reasoning (ESTAR), a method that detects and reduces redundant reasoning in large reasoning models (LRMs) to improve efficiency without sacrificing accuracy.
Key Points
ESTAR combines a trajectory-based classifier that identifies when reasoning can be safely stopped, supervised fine-tuning to teach LRMs to propose self-generated signals, and -aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards.
Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about x3.7 (from 4799 to 1290) while preserving 98.9% of the original accuracy (74.9% vs. 74.2%).
ESTAR outperforms other efficient reasoning baselines, reducing reasoning length by up to x7 while maintaining high accuracy.
ESTAR’s cross-domain generalization highlights early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.
Methodology
ESTAR uses a lightweight classifier (ESTAR-LITE) to detect when redundant thinking occurs, allowing reasoning to be safely truncated. It then trains the LRM to propose its own tokens, narrowing the search space of potential early-stop positions. Finally, ESTAR integrates the self-generated stop signals with a reinforcement learning loss function that explicitly rewards correct emissions and truncates rollouts at the detected early-stop points.
Results and Findings
Across four reasoning benchmarks, ESTAR reduced the average chain-of-thought length from 4799 to 1290 tokens (a x3.7 reduction) while retaining 98.9% of the original accuracy (74.9% vs. 74.2%). This outperformed other efficient reasoning baselines, such as Length Penalty (x1.4 shorter, 97.0% relative accuracy) and AdaptThink (x2.2 shorter, 97.4% relative accuracy).
Implications and Conclusions
These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in large reasoning models, without compromising performance. ESTAR’s strong cross-domain generalization demonstrates the versatility of this approach in enhancing the practical deployment of advanced reasoning systems.
CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment
Authors: Nanda Rani, Kimberly Milner, Minghao Shao, Meet Udeshi, Haoran Xi, Venkata Sai Charan Putrevu, Saksham Aggarwal, Sandeep K. Shukla, Prashanth Krishnamurthy, Farshad Khorrami, Muhammad Shafique, Ramesh Karri
Source and references: https://arxiv.org/abs/2602.08023v2
Introduction
This paper introduces CyberExplorer, a benchmark suite for evaluating the offensive security capabilities of large language models (LLMs) in a real-world attacking simulation environment. CyberExplorer aims to address the limitations of existing closed-world settings by creating a partially observable and noisy environment with multiple concurrent vulnerable services.
Key Points
An Open Environment Offensive Security Task that shifts evaluation from isolated challenges to multi-target environments, requiring agents to reason about vulnerable services, multiple exploitable targets, and attack prioritization.
An asynchronous multi-agent architecture featuring parallel entrypoint exploration, supervisor guidance, and critic intervention to support dynamic exploration without predefined plans.
Comprehensive evaluations of state-of-the-art LLMs across correctness, efficiency, coordination, failure modes, and vulnerability discovery signals.
Methodology
CyberExplorer is built upon a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges. Agents are not provided prior knowledge of service identities, vulnerability locations, or challenge boundaries, and must infer exploitable targets through probing, interaction feedback, and hypothesis refinement. The benchmark is evaluated using a diverse set of LLMs, including both closed-source and open-source models.
Results and Findings
The results demonstrate that effective autonomous exploitation depends not merely on the volume of interactions or discovered flags, but on the agent’s ability to make confident and correct decisions. Models like Claude Opus 4.5 and Gemini 3 Pro benefit from structured reasoning and exhibit rapid hypothesis alignment, while others like Qwen 3 and DeepSeek V3 show prolonged exploration without effective refinement. The analysis also reveals that agent failures are predominantly driven by persistence in incorrect hypotheses, leading to budget exhaustion and significantly higher interaction and cost overhead compared to successful trajectories. Additionally, unsuccessful exploitation attempts can still yield meaningful security intelligence, as agents often surface vulnerability signals and security-relevant findings during exploration.
Implications and Conclusions
CyberExplorer provides a behavior-centric evaluation framework for simulating open-ended attack environments, exposing reasoning efficiency, coordination dynamics, failure persistence, and security-relevant signals that are invisible in existing closed-world benchmarks. By transitioning from isolated challenges to multi-target environments, CyberExplorer enables a more comprehensive and realistic assessment of LLMs’ offensive security capabilities, informing the development of more robust and security-aware agents.
ALIVE: Animate Your World with Lifelike Audio-Video Generation
Authors: Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan
Source and references: https://arxiv.org/abs/2602.08682v2
Introduction
The paper presents ALIVE, a generation model that adapts a pre-trained Text-to-Video (T2V) model to enable Sora-style audio-video generation and animation.
Key Points
ALIVE introduces a joint audio-video modeling architecture that seamlessly integrates Audio and Video DiTs via an extended “Dual Stream + Single Stream” paradigm.
The paper presents comprehensive audio-video data pipelines that perform dual-quality filtering on both audio and video modalities, and employ a joint ‘visual + audio’ keyword labeling system.
The authors introduce Alive-Bench 1.0, a comprehensive benchmark for joint audio-visual generation that evaluates model performance across multiple dimensions.
The paper introduces a cross-pair pipeline and a unified-editing-based reference augmentation pipeline to enable character-driven video generation for role-playing scenarios.
Methodology
ALIVE’s core architecture is a Joint Audio-Video DiT, based on rectified flow Transformers. It seamlessly integrates a well-designed Audio DiT with the Video DiT. The authors introduce UniTemp-RoPE and TA-CrossAttn to address temporal alignment between audio and video latents.
Results and Findings
ALIVE demonstrates superior performance on the Alive-Bench 1.0 benchmark, achieving state-of-the-art results across various domains including visual aesthetics, audio prompt following, and audio-video synchronization. The model’s reference animation capabilities enable lifelike audio-visual content creation.
Implications and Conclusions
By natively supporting reference animation within the audio-video synthesis framework, ALIVE empowers everyone to animate their world with lifelike audio-visual content and fosters further advancements in multi-modal generative research.
Causality in Video Diffusers is Separable from Denoising
Authors: Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu
Source and references: https://arxiv.org/abs/2602.10095v1
Introduction
This paper investigates the role of causality in video diffusion models and proposes a novel architecture called Separable Causal Diffusion (SCD) that decouples temporal reasoning from iterative denoising.
Key Points
Through probing and finetuning experiments, the authors identify two key regularities in causal video diffusion models: 1) middle-layer features exhibit strong consistency across denoising steps, and 2) cross-frame attention becomes sparse in deeper layers.
The authors introduce SCD, an architecture that separates temporal reasoning via a causal transformer encoder from frame-wise rendering via a lightweight diffusion decoder.
SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines, on both pretraining and post-training tasks across synthetic and real benchmarks.
Methodology
The authors adopt an autoregressive video diffusion model as a testbed and conduct systematic probing experiments to uncover the observed regularities. Motivated by these findings, they design the SCD architecture, which explicitly decouples temporal reasoning from iterative denoising.
Results and Findings
The probing experiments reveal that middle-layer features in causal video diffusion models exhibit high cosine similarity across denoising steps, indicating redundant computation. Additionally, the authors find that cross-frame attention becomes sparse in deeper layers, suggesting that temporal reasoning is primarily performed in the earlier layers. Based on these observations, the authors introduce SCD, which achieves substantial computational speedups while preserving generation quality compared to strong causal diffusion baselines.
Implications and Conclusions
The paper demonstrates that the causal reasoning in video diffusion models can be separated from the multi-step denoising process, leading to more efficient architectures. The proposed SCD approach highlights the potential for decoupling temporal and spatial reasoning in generative models, which could have broader implications for video and other sequential data modeling tasks.
Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders
Authors: Amandeep Kumar, Vishal M. Patel
Source and references: https://arxiv.org/abs/2602.10099v1
Introduction
This paper introduces Riemannian Flow Matching with Jacobi Regularization (RJF), a method for effectively training standard Diffusion Transformer architectures on high-dimensional representation encoders like DINOv2.
Key Points
The authors identify Geometric Interference as the fundamental bottleneck preventing standard diffusion transformers from learning on high-dimensional representations.
They propose Riemannian Flow Matching to define the generative process directly on the hyperspherical manifold, ensuring the trajectories follow geodesics.
They introduce Jacobi Regularization to account for curvature-induced error propagation on the manifold.
Their method enables standard DiT architectures to converge efficiently without the need for computationally expensive width scaling.
Methodology
The authors reformulate the diffusion process to operate directly on the intrinsic data manifold by projecting features to the unit hypersphere and defining the conditional probability paths using Spherical Linear Interpolation (SLERP) rather than linear interpolation. They further introduce Jacobi Regularization to account for curvature-induced error propagation on the manifold.
Results and Findings
The authors demonstrate that their RJF method outperforms standard Euclidean Flow Matching and prior methods that require complex auxiliary losses or architectural changes. On the 131M-parameter DiT-B model, RJF achieves an FID of 3.37 with guidance and 4.95 without guidance, significantly surpassing the Euclidean baseline. These gains persist at larger scales, with the DiT-XL model reaching an FID of 3.62 in 80 epochs without guidance.
Implications and Conclusions
The authors’ findings show that respecting the intrinsic geometry of high-dimensional representation spaces is critical for efficient generative modeling, enabling standard Diffusion Transformer architectures to converge effectively without the need for computationally expensive width scaling.
Biases in the Blind Spot: Detecting What LLMs Fail to Mention
Authors: Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu
Source and references: https://arxiv.org/abs/2602.10117v1
Introduction
This paper investigates the presence of “unverbalized biases” in large language models (LLMs), which are biases that influence the model’s decision-making process without being explicitly cited in the model’s chain-of-thought (CoT) reasoning.
Key Points
The authors introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases in LLMs.
The pipeline uses LLM autoreaters to generate candidate bias concepts, which are then tested on progressively larger input samples.
Statistical techniques are applied to identify concepts that yield statistically significant performance differences without being cited in the model’s CoT.
The pipeline is evaluated on six LLMs across three decision tasks (hiring, loan approval, and university admissions).
The technique automatically discovers previously unknown biases (e.g., Spanish fluency, English proficiency, writing formality) and validates biases identified by prior work (e.g., gender, race, religion, ethnicity).
Methodology
The authors’ pipeline uses LLM autoreaters to generate candidate bias concepts, which are then tested on progressively larger input samples. Statistical techniques, including multiple testing and early stopping, are applied to identify concepts that yield statistically significant performance differences without being cited in the model’s CoT.
Results and Findings
The authors’ pipeline successfully identified previously unknown biases in the six LLMs across the three decision tasks, such as biases related to Spanish fluency, English proficiency, and writing formality. Additionally, the pipeline validated biases that had been manually identified by prior work, including biases related to gender, race, religion, and ethnicity.
Implications and Conclusions
The authors’ proposed approach provides a practical and scalable way to automatically discover task-specific biases in LLMs, which is important for improving the reliability and transparency of these models’ decision-making processes.
Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy
Authors: Júlio Oliveira, Rodrigo Ferreira, André Riker, Glaucio H. S. Carvalho, Eirini Eleni Tsilopoulou
Source and references: https://arxiv.org/abs/2602.10100v1
Introduction
This paper aims to achieve a machine learning (ML) model that combines enhanced data privacy with explainability. The authors propose a Federated Learning (FL) solution called Federated EXplainable Trees with Differential Privacy (FEXT-DP) that uses Decision Trees as the underlying model and applies Differential Privacy (DP) to provide an additional layer of data privacy protection.
Key Points
The proposed FEXT-DP model is based on Decision Trees, which are lightweight and have superior explainability compared to neural networks-based FL systems.
FEXT-DP applies Differential Privacy (DP) to the Tree-Based model to provide an additional layer of data privacy protection.
The paper also presents the impact of DP protection on the explainability of the ML model, as adding DP can harm the explainability of the system.
Methodology
The authors propose the FEXT-DP model, which combines Federated Learning, Decision Trees, and Differential Privacy. The Decision Tree-based model is chosen for its explainability, while Differential Privacy is applied to provide an additional layer of data privacy protection.
Results and Findings
The performance assessment carried out by the authors shows that the FEXT-DP model achieves improvements in terms of faster training (fewer rounds), lower Mean Squared Error, and better explainability compared to other approaches. However, the authors also find that adding DP can harm the explainability of the model.
Implications and Conclusions
The research presented in this paper offers a promising approach for achieving both data privacy and explainability in machine learning models, which are important aspects for modern ML systems. The findings on the impact of Differential Privacy on explainability provide valuable insights for researchers and practitioners working on the intersection of these two important domains.
Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability
Authors: Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana
Source and references: https://arxiv.org/abs/2602.10067v1
Introduction
This paper presents a novel approach called RLFR (Reinforcement Learning from Feature Rewards) that leverages interpretable model features as scalable supervision signals for open-ended tasks, using the task of reducing hallucinations in language models as a case study.
Key Points
Introduces the use of model features as a source of supervision for learning open-ended tasks, in contrast to their typical use for test-time monitoring or steering.
Develops a concrete framework for operationalizing this approach for the task of reducing hallucinations, using a decomposed probing protocol to monitor for hallucinations and reward retractions/corrections.
Demonstrates that when applied to the Gemma-3-12B-IT model, the RLFR approach produces a policy that is 58% less likely to hallucinate than the original model, while being significantly more scalable than using an external evaluator.
Shows that the use of features as rewards also enables scalable test-time compute improvements via techniques like Best-of-N sampling.
Methodology
The paper proposes the RLFR pipeline, which uses standard interpretability techniques like probing to read out a model’s “beliefs” about relevant concepts (e.g., the factual validity of a claim). These feature readouts are then used as dense, scalable reward signals for training the model to reduce hallucinations through reinforcement learning.
Results and Findings
When applied to the Gemma-3-12B-IT model, the RLFR approach resulted in a policy that was 58% less likely to hallucinate compared to the original model. Additionally, the use of feature-derived rewards was found to be approximately 90x cheaper to run per rewarded intervention than the ground truth supervision source.
Implications and Conclusions
By grounding supervision in the language of interpretable model features, this work introduces a novel paradigm in the use of interpretability research, where features can serve as oversight signals to intentionally design models with desirable open-ended capabilities. The authors believe this represents a significant step towards addressing the challenge of learning open-ended behaviors in language models.
Universal computation is intrinsic to language model decoding
Authors: Alex Lewandowski, Marlos C. Machado, Dale Schuurmans
Source and references: https://arxiv.org/abs/2601.08061v2
Introduction
This paper proves that chaining the autoregressive outputs from existing language models is sufficient to perform universal computation, meaning a language model can simulate the execution of any algorithm on any input. The authors also demonstrate that even randomly initialized language models are capable of universal computation before training.
Key Points
Language models can perform universal computation by chaining their autoregressive outputs.
Randomly initialized language models are also capable of universal computation, independent of training.
The computational capabilities of language models are intrinsic to their autoregressive decoding process, rather than arising from training.
Training language models primarily shapes how we interact with these systems, enabling a natural language interface for accessing their computational capabilities.
Failures to elicit desired behaviors from language models are due to challenges in prompt engineering, not inherent computational limitations.
Methodology
The authors prove computational universality by establishing that language model decoding is equivalent to the Lag system, a variation of Post’s tag system that is known to be computationally universal. They demonstrate this equivalence through two strategies: 1) finding a system prompt that drives a trained language model to correctly execute each production rule in a universal Lag system, and 2) learning an injective codebook that enables a randomly initialized language model to simulate the universal Lag system.
Results and Findings
The authors show that autoregressive decoding of both trained and randomly initialized language models, across various sequence modeling architectures, is capable of universal computation. They find that once a model reaches a minimal architecture-specific size, an injective codebook can be learned to drive the randomly initialized model to correctly execute all production rules in the universal Lag system.
Implications and Conclusions
These findings establish that computational universality is an intrinsic property of autoregressive decoding, independent of training on natural language data. This implies that the challenges in eliciting desired behaviors from language models are due to the difficulty of prompt engineering, rather than inherent computational limitations. The authors interpret language models as providing a natural language interface between humans and computers, potentially establishing a third age in the evolution of computational systems.
In-Context Learning Without Copying
Authors: Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler
Source and references: https://arxiv.org/abs/2511.05743v2
Introduction
This paper investigates whether induction heads, which perform inductive copying by matching patterns from earlier context, are a necessary building block for learning abstractive in-context learning (ICL) capabilities, or whether such capabilities can emerge independently.
Key Points
The authors introduce the HAPAX training regime, which omits the loss contribution of tokens predictable by induction heads, to suppress inductive copying.
Despite a significant reduction in inductive copying, the HAPAX model achieves higher accuracy than the vanilla model on 13 out of 21 abstractive ICL tasks.
Mechanistic analysis shows that HAPAX models develop fewer and weaker induction heads, and that a majority of prefix-matching heads negatively influence copying.
The token-loss difference metric primarily reflects gains from inductive copying and lacks indicative power for the emergence of abstractive ICL capabilities.
The findings suggest that abstractive ICL capabilities can emerge more independently of induction heads than previously hypothesized.
Methodology
The authors introduce the HAPAX training regime, which masks the loss contributions of token positions that contain a matching n-gram within the same context window (where n > 1). This effectively removes the incentive for the model to learn inductive copying.
Results and Findings
The HAPAX model experiences a 66% drop in random repetition performance, indicating a significant reduction in inductive copying. However, the model preserves its abstractive ICL capabilities, achieving higher accuracy on 13 out of 21 tasks compared to the vanilla model.
Mechanistic analysis reveals that HAPAX models develop fewer and weaker induction heads, and that a majority of heads displaying prefix-matching patterns negatively influence copying. Additionally, the token-loss difference metric, which is strongly influenced by induction heads, does not correlate well with the emergence of abstractive ICL capabilities.
Implications and Conclusions
The findings suggest that abstractive ICL capabilities can emerge more independently of induction heads than previously hypothesized, indicating a weaker developmental link between the two. This provides insights into the emergence of different ICL mechanisms during training and the role of inductive copying in the development of in-context learning capabilities.
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models
Authors: Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham
Source and references: https://arxiv.org/abs/2411.00918v4
Introduction
This paper introduces LibMoE, a unified framework for reproducible and efficient research on Mixture-of-Experts (MoE) models in large language models and vision-language models.
Key Points
LibMoE provides a standardized platform for training and evaluating state-of-the-art MoE algorithms under realistic resource constraints.
The framework enables comprehensive analysis of routing dynamics, expert specialization, and load balancing across different MoE architectures and training regimes.
LibMoE supports both full pretraining and sparse upcycling approaches, allowing researchers to explore MoE models in both early-stage and late-stage settings.
The authors conduct a large-scale empirical study using LibMoE, uncovering key insights about the factors shaping MoE performance and stability.
Methodology
The authors implement seven state-of-the-art MoE algorithms within the LibMoE framework, including methods that modify the router network, utilize shared experts, and introduce specialized expert types. They evaluate these algorithms on both language modeling and vision-language tasks, using standardized pipelines for both full pretraining and sparse upcycling.
Results and Findings
The performance gap among current MoE algorithms is relatively small, suggesting the need for more substantial improvements to justify increased architectural complexity.
Routing stability, expert selection optimality, and task-dependent specialization patterns vary significantly across MoE methods.
A simple reduction in the router’s initialization standard deviation can lead to better expert utilization and load balancing, highlighting initialization as an effective control knob.
Pretraining and sparse upcycling exhibit distinct routing dynamics, with shared experts playing a key role in stabilizing routing behavior in the upcycling regime.
Expert representations remain diverse across all algorithms, supporting the practicality of sparse upcycling as a compute-efficient alternative to full pretraining.
Implications and Conclusions
By providing a unified and accessible framework for MoE research, LibMoE aims to accelerate progress in this area and foster a more collaborative, open-source community. The comprehensive analysis offered by LibMoE provides actionable insights that go beyond standard performance metrics, guiding the development of the next generation of scalable, efficient, and interpretable large models.
ST4VLA: Spatially Guided Training for Vision-Language-Action Models
Authors: Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang
Source and references: https://arxiv.org/abs/2602.10109v1
Introduction
This paper proposes ST4VLA, a dual-system Vision-Language-Action (VLA) framework that leverages Spatially Guided Training to align action learning with spatial priors in large vision-language models (VLMs).
Key Points
ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable spatial priors, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation.
This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives.
ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 to 84.6 on Google Robot and from 54.7 to 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv.
ST4VLA demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings.
Methodology
ST4VLA adopts a dual-system architecture, where the VLM Planner captures spatial and semantic priors, while the Action Expert specializes these priors into embodiment-specific motor commands. Spatial prompting is used to activate the VLM’s spatial perception capability during post-action training, and a lightweight querying transformer is introduced to stabilize expert learning and inference.
Results and Findings
Empirical analysis shows that directly fine-tuning a VLM with an action expert leads to a collapse of spatial priors, while naive co-training introduces gradient conflicts. In contrast, ST4VLA’s spatially guided training effectively mitigates these issues, preserving perception while enabling robust control.
On public benchmarks, ST4VLA outperforms state-of-the-art VLA models, achieving a 5.9% gain in Google Robot Visual Matching, a 5.3% gain in Visual Aggregation, and a 9.8% gain on the WidowX benchmark. It also demonstrates strong generalization on large-scale simulated pick-and-place tasks and real-world cluttered-scene manipulation, outperforming strong baselines like π0 and GR00T.
Implications and Conclusions
These findings highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning, bridging high-level multimodal reasoning with low-level embodied control.
VLA-JEPA: Enhancing Vision-Language-Action Model with Latent World Model
Authors: Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, Zhibo Chen
Source and references: https://arxiv.org/abs/2602.10098v1
Introduction
This paper introduces VLA-JEPA, a JEPA-style pretraining framework for Vision-Language-Action (VLA) models that learns latent representations capturing action-relevant state transition semantics.
Key Points
Analysis of how existing “latent action from video” objectives often remain anchored to pixel variation rather than learning meaningful transition dynamics.
Proposed VLA-JEPA framework that uses “leakage-free state prediction” to learn robust latent representations without pixel reconstruction or future information leakage.
VLA-JEPA enables a simple two-stage training pipeline (JEPA pretraining followed by action-head fine-tuning), avoiding the multi-stage complexity of prior latent-action approaches.
Experiments show VLA-JEPA achieves consistent gains in generalization and robustness over existing VLA methods on simulation benchmarks and real-world robotic tasks.
Methodology
VLA-JEPA adopts a JEPA-style training objective where a target encoder produces latent representations from future frames, while the student pathway sees only the current observation. This “leakage-free state prediction” encourages the model to learn abstract dynamics representations robust to camera motion and background changes.
Results and Findings
Experiments on the LIBERO, LIBERO-Plus, and SimplerEnv benchmarks demonstrate that VLA-JEPA outperforms prior VLA methods that rely on latent action or future prediction objectives. VLA-JEPA also shows strong performance on real-world robotic manipulation tasks, surpassing state-of-the-art VLA models.
Implications and Conclusions
The VLA-JEPA framework provides a principled approach to learning action-centric representations from video data, enabling more robust and generalizable VLA policies with a simpler training workflow compared to prior multi-stage pipelines.
From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Authors: Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
Source and references: https://arxiv.org/abs/2510.17439v2
Introduction
The paper introduces FALCON (From Spatial to Action), a novel vision-language-action (VLA) model that integrates rich 3D spatial tokens to enhance the model’s spatial reasoning capabilities for robotic manipulation tasks.
Key Points
FALCON leverages spatial foundation models to provide strong geometric priors from RGB input alone, addressing the spatial reasoning gap in existing 2D-based VLAs.
The Embodied Spatial Model can optionally fuse depth, pose, and other 3D modalities to improve spatial representation, while maintaining robustness with RGB-only input.
FALCON employs a Spatial-Enhanced Action Head that directly incorporates spatial tokens into the action generation, preserving the pre-trained vision-language alignment in the backbone model.
The proposed approach enables FALCON to achieve state-of-the-art performance with improved robustness and generalization across simulation benchmarks and real-world manipulation tasks.
Methodology
FALCON consists of three core components: a 2D vision-language model, an Embodied Spatial Model, and a Spatial-Enhanced Action Head. The Embodied Spatial Model encodes RGB, depth, and camera pose inputs into a set of rich spatial tokens, which are then fused with the semantic action representation from the vision-language model to guide the action prediction.
Results and Findings
FALCON consistently outperforms existing VLA methods on both simulation benchmarks (e.g., CALVIN, SimplerEnv) and real-world manipulation tasks. It achieves state-of-the-art performance in long-horizon, language-guided robot control, and demonstrates strong robustness to spatial variations, such as unseen object sizes, heights, and abstract spatial instructions.
Implications and Conclusions
The proposed FALCON framework effectively integrates spatial understanding into generalist robot policies, addressing a critical limitation in existing VLAs. By leveraging rich 3D priors and flexible fusion of multi-modal cues, FALCON significantly enhances the spatial reasoning capabilities of VLA models, leading to more reliable and adaptable robot manipulation in complex, unstructured environments.



