State of AI

State of AI

Efficient Long Sequence Generation, Pose-Based Fencing Refereeing, and Scaling Laws for Productivity

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Dec 27, 2025
∙ Paid

Welcome to today’s edition of State of AI and our last of the year 👋

In this edition, we explore a range of cutting-edge AI research, from efficient long image sequence generation using novel diffusion models, to pose-based semantic pipelines for automated fencing refereeing, and intriguing scaling laws that quantify the impact of large language models on professional productivity.

Here’s what caught our attention:

  • GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation - A novel approach that treats image sequences as grids of subsampled frames, enabling efficient coarse sequence generation followed by high-resolution refinement.

  • FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing - A framework that converts broadcast fencing video into action tokens and rule-grounded explanations, demonstrating the power of pose-based semantic grounding.

  • Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks - A large-scale study that derives empirical relationships between language model size and professional productivity, with implications for future productivity gains.

  • Learning to Refocus with Video Diffusion Models - A method for realistic post-capture refocusing using video diffusion models, enabling interactive refocusing from a single defocused image.

  • ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision - A framework that directly modulates the internal attention maps of a diffusion transformer to enable precise control over video generation.

  • Assessing the Software Security Comprehension of Large Language Models - A systematic evaluation of LLMs’ understanding of software security concepts, providing insights into their limitations and future research directions.

  • GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer - A novel transformer-based architecture that integrates geometry-aware attention mechanisms for improved performance on computational fluid dynamics tasks.

Let’s get into it 👇

Contents

  1. Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

  2. A Real-World Evaluation of LLM Medication Safety Reviews in NHS Primary Care

  3. FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing

  4. GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

  5. Learning to Refocus with Video Diffusion Models

  6. ACD: Direct Conditional Control for Video Diffusion Models via Attention Supervision

  7. Assessing the Software Security Comprehension of Large Language Models

  8. GeoTransolver: Learning Physics on Irregular Domains Using Multi-scale Geometry Aware Physics Attention Transformer

  9. MiST: Understanding the Role of Mid-Stage Scientific Training in Developing Chemical Reasoning Models

  10. Parallel Token Prediction for Language Models

  11. Measuring all the noises of LLM Evals

  12. SMART SLM: Structured Memory and Reasoning Transformer, A Small Language Model for Accurate Document Assistance

  13. Scaling Laws for Economic Productivity: Experimental Evidence in LLM-Assisted Consulting, Data Analyst, and Management Tasks

  14. ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering

  15. Schrödinger’s Navigator: Imagining an Ensemble of Futures for Zero-Shot Object Navigation

Casting a SPELL: Sentence Pairing Exploration for LLM Limitation-breaking

Authors: Yifan Huang, Xiaojun Jia, Wenbo Guo, Yuqiang Sun, Yihao Huang, Chong Wang, Yang Liu

Source and references: https://arxiv.org/abs/2512.21236v1


Introduction

This research paper presents SPELL, a comprehensive testing framework for evaluating the security alignment of large language models (LLMs) against malicious code generation. The authors highlight the increasing integration of LLMs into critical software development workflows and the associated security challenges, particularly the ability of adversaries to exploit model vulnerabilities through carefully crafted prompts.

Key Points

  • SPELL is the first automated framework that dynamically discovers and combines prompt components for malicious code generation, overcoming the limitations of fixed template-based approaches.

  • Extensive evaluations on three major LLMs (GPT-4.1, Qwen2.5 Coder, Claude-3.5) demonstrate SPELL’s superior performance, achieving attack success rates of 83.75%, 68.12%, and 19.38% respectively across eight malicious code categories.

  • SPELL’s component effectiveness analysis provides insights into LLM vulnerabilities and informs future safety alignment research.

  • The authors develop and validate defense strategies based on their findings, achieving 90-100% rejection rates across the tested models, providing immediately deployable countermeasures.

  • The authors provide comprehensive artifacts, including datasets, code, and experimental results, to enable replication and further research in LLM security testing.

Methodology

SPELL employs a time-division epsilon-greedy sentence selection strategy to efficiently navigate the vast prompt space and identify effective sentence combination patterns without requiring exhaustive search. The framework first constructs a Prior Knowledge Dataset of diverse sentence components extracted from existing attack templates, then dynamically selects and combines these components to generate novel, previously unseen attack prompts.

Results and Findings

The results demonstrate SPELL’s significant effectiveness across the evaluated models, outperforming existing methods like Redcode, CL-GSO, and RL-breaker. SPELL achieves attack success rates of 83.75% on GPT-4.1, 68.12% on Qwen2.5-Coder, and 19.38% on Claude-3.5, while requiring only 124-160 steps on average to identify successful attack patterns. The authors’ comprehensive evaluation reveals substantial security gaps in current LLM implementations, with successful deployment in real-world applications like Cursor.

Implications and Conclusions

User's avatar

Continue reading this post for free, courtesy of State of AI.

Or purchase a paid subscription.
© 2026 StateOfAI · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture