Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs

Latest research summaries in ML, Robotics, CV, NLP and AI

Jun 10, 2025

∙ Paid

Welcome to today’s edition of State of AI
👋 And a warm welcome to our 84 new subscribers since the last edition!

Today’s research dives into front-end code generation, diffusion watermark forgeries, and constraint modeling with LLMs. If you’re interested in MLLMs powering UI workflows, multimodal dataset safety, or how LLMs reason about cause and effect then this one’s for you.

Here’s what caught our attention:

DesignBench benchmarks MLLMs for automated front-end development across React, Vue, Angular, and HTML finally evaluating how well these models perform in real-world generation, editing, and repair workflows.
CP-Bench offers the first rigorous benchmark for LLMs in constraint programming, showing how system prompts and retrieval-augmented inference can improve translation from natural language to MiniZinc, OR-Tools, and CPMpy.
Plug-and-Plant reveals a watermark forgery method that requires no optimization and exploits regenerative diffusion models to plant convincing watermarks, challenging the reliability of current watermarking schemes.
Leopard introduces a vision-language model purpose-built for handling multiple text-rich images like web pages and multi-page documents, outperforming the state-of-the-art on 12 benchmarks.
MimeQA pushes video LLMs into a new direction by testing them on mime performances, highlighting just how poorly even advanced models understand nonverbal communication.
LlavaGuard builds a full framework for visual dataset moderation using open multimodal models, finally giving us a customizable safety layer for images and VLM outputs.
Do LLMs Reason Causally Like Us? Yes, and sometimes better. A study shows GPT-4o and Claude outperform humans in avoiding associative bias in causal reasoning, though still fall short on subtler inference patterns.

There’s also research on robot navigation with Astra, semantic compression with BEAST, efficient model scaling with Kinetics, and safeguarding autonomous driving with SafeAuto.

Let’s get into it 👇

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Authors: Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

Source and references: https://arxiv.org/abs/2506.06251v1

Introduction

This paper introduces DesignBench, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end code generation. The benchmark addresses key limitations of existing evaluations, such as the lack of mainstream development framework integration, insufficient task coverage, and limited evaluation dimensions.

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Introduction

Keep reading with a 7-day free trial