Efficient Long Sequence Generation, Pose-Based Fencing Refereeing, and Scaling Laws for Productivity
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today’s edition of State of AI and our last of the year 👋
In this edition, we explore a range of cutting-edge AI research, from efficient long image sequence generation using novel diffusion models, to pose-based semantic pipelines for automated fencing refereeing, and intriguing scaling laws that quantify the impact of large language models on professional productivity.
Here’s what caught our attention:
GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation - A novel approach that treats image sequences as grids of subsampled frames, enabling efficient coarse sequence generation followed by high-resolution refinement.
FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing - A framework that converts broadcast fencing video into action tokens and rule-grounded explanations, demonstrating the power of pose-based semantic grounding.
Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks - A large-scale study that derives empirical relationships between language model size and professional productivity, with implications for future productivity gains.
Learning to Refocus with Video Diffusion Models - A method for realistic post-capture refocusing using video diffusion models, enabling interactive refocusing from a single defocused image.
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision - A framework that directly modulates the internal attention maps of a diffusion transformer to enable precise control over video generation.
Assessing the Software Security Comprehension of Large Language Models - A systematic evaluation of LLMs’ understanding of software security concepts, providing insights into their limitations and future research directions.
GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer - A novel transformer-based architecture that integrates geometry-aware attention mechanisms for improved performance on computational fluid dynamics tasks.
Let’s get into it 👇
Contents
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking
A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care
FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing
GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation
ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision
Assessing the Software Security Comprehension of Large Language Models
Schrödinger’s Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation
Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking
Authors: Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, Yang Liu
Source and references: https://arxiv.org/abs/2512.21236v1
Introduction
This research paper presents SPELL, a comprehensive testing framework for evaluating the security alignment of large language models (LLMs) against malicious code generation. The authors highlight the increasing integration of LLMs into critical software development workflows and the associated security challenges, particularly the ability of adversaries to exploit model vulnerabilities through carefully crafted prompts.
Key Points
SPELL is the first automated framework that dynamically discovers and combines prompt components for malicious code generation, overcoming the limitations of fixed template-based approaches.
Extensive evaluations on three major LLMs (GPT-4.1, Qwen2.5 Coder, Claude-3.5) demonstrate SPELL’s superior performance, achieving attack success rates of 83.75%, 68.12%, and 19.38% respectively across eight malicious code categories.
SPELL’s component effectiveness analysis provides insights into LLM vulnerabilities and informs future safety alignment research.
The authors develop and validate defense strategies based on their findings, achieving 90-100% rejection rates across the tested models, providing immediately deployable countermeasures.
The authors provide comprehensive artifacts, including datasets, code, and experimental results, to enable replication and further research in LLM security testing.
Methodology
SPELL employs a time-division epsilon-greedy sentence selection strategy to efficiently navigate the vast prompt space and identify effective sentence combination patterns without requiring exhaustive search. The framework first constructs a Prior Knowledge Dataset of diverse sentence components extracted from existing attack templates, then dynamically selects and combines these components to generate novel, previously unseen attack prompts.
Results and Findings
The results demonstrate SPELL’s significant effectiveness across the evaluated models, outperforming existing methods like Redcode, CL-GSO, and RL-breaker. SPELL achieves attack success rates of 83.75% on GPT-4.1, 68.12% on Qwen2.5-Coder, and 19.38% on Claude-3.5, while requiring only 124-160 steps on average to identify successful attack patterns. The authors’ comprehensive evaluation reveals substantial security gaps in current LLM implementations, with successful deployment in real-world applications like Cursor.



