Efficient Long Sequence Generation, Pose-Based Fencing Refereeing, and Scaling Laws for Productivity

Latest research summaries in ML, Robotics, CV, NLP and AI

Dec 27, 2025

∙ Paid

Welcome to today’s edition of State of AI and our last of the year 👋

In this edition, we explore a range of cutting-edge AI research, from efficient long image sequence generation using novel diffusion models, to pose-based semantic pipelines for automated fencing refereeing, and intriguing scaling laws that quantify the impact of large language models on professional productivity.

Here’s what caught our attention:

GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation - A novel approach that treats image sequences as grids of subsampled frames, enabling efficient coarse sequence generation followed by high-resolution refinement.
FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing - A framework that converts broadcast fencing video into action tokens and rule-grounded explanations, demonstrating the power of pose-based semantic grounding.
Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks - A large-scale study that derives empirical relationships between language model size and professional productivity, with implications for future productivity gains.
Learning to Refocus with Video Diffusion Models - A method for realistic post-capture refocusing using video diffusion models, enabling interactive refocusing from a single defocused image.
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision - A framework that directly modulates the internal attention maps of a diffusion transformer to enable precise control over video generation.
Assessing the Software Security Comprehension of Large Language Models - A systematic evaluation of LLMs’ understanding of software security concepts, providing insights into their limitations and future research directions.
GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer - A novel transformer-based architecture that integrates geometry-aware attention mechanisms for improved performance on computational fluid dynamics tasks.

Let’s get into it 👇

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Authors: Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, Yang Liu

Source and references: https://arxiv.org/abs/2512.21236v1

Introduction

This research paper presents SPELL, a comprehensive testing framework for evaluating the security alignment of large language models (LLMs) against malicious code generation. The authors highlight the increasing integration of LLMs into critical software development workflows and the associated security challenges, particularly the ability of adversaries to exploit model vulnerabilities through carefully crafted prompts.

Key Points

SPELL is the first automated framework that dynamically discovers and combines prompt components for malicious code generation, overcoming the limitations of fixed template-based approaches.
Extensive evaluations on three major LLMs (GPT-4.1, Qwen2.5 Coder, Claude-3.5) demonstrate SPELL’s superior performance, achieving attack success rates of 83.75%, 68.12%, and 19.38% respectively across eight malicious code categories.
SPELL’s component effectiveness analysis provides insights into LLM vulnerabilities and informs future safety alignment research.
The authors develop and validate defense strategies based on their findings, achieving 90-100% rejection rates across the tested models, providing immediately deployable countermeasures.
The authors provide comprehensive artifacts, including datasets, code, and experimental results, to enable replication and further research in LLM security testing.

Methodology

SPELL employs a time-division epsilon-greedy sentence selection strategy to efficiently navigate the vast prompt space and identify effective sentence combination patterns without requiring exhaustive search. The framework first constructs a Prior Knowledge Dataset of diverse sentence components extracted from existing attack templates, then dynamically selects and combines these components to generate novel, previously unseen attack prompts.

Results and Findings

The results demonstrate SPELL’s significant effectiveness across the evaluated models, outperforming existing methods like Redcode, CL-GSO, and RL-breaker. SPELL achieves attack success rates of 83.75% on GPT-4.1, 68.12% on Qwen2.5-Coder, and 19.38% on Claude-3.5, while requiring only 124-160 steps on average to identify successful attack patterns. The authors’ comprehensive evaluation reveals substantial security gaps in current LLM implementations, with successful deployment in real-world applications like Cursor.

Implications and Conclusions

Continue reading this post for free, courtesy of State of AI.

Or purchase a paid subscription.