Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs
Latest research summaries in ML, Robotics, CV, NLP and AI
Welcome to today’s edition of State of AI
👋 And a warm welcome to our 84 new subscribers since the last edition!
Today’s research dives into front-end code generation, diffusion watermark forgeries, and constraint modeling with LLMs. If you’re interested in MLLMs powering UI workflows, multimodal dataset safety, or how LLMs reason about cause and effect then this one’s for you.
Here’s what caught our attention:
DesignBench benchmarks MLLMs for automated front-end development across React, Vue, Angular, and HTML finally evaluating how well these models perform in real-world generation, editing, and repair workflows.
CP-Bench offers the first rigorous benchmark for LLMs in constraint programming, showing how system prompts and retrieval-augmented inference can improve translation from natural language to MiniZinc, OR-Tools, and CPMpy.
Plug-and-Plant reveals a watermark forgery method that requires no optimization and exploits regenerative diffusion models to plant convincing watermarks, challenging the reliability of current watermarking schemes.
Leopard introduces a vision-language model purpose-built for handling multiple text-rich images like web pages and multi-page documents, outperforming the state-of-the-art on 12 benchmarks.
MimeQA pushes video LLMs into a new direction by testing them on mime performances, highlighting just how poorly even advanced models understand nonverbal communication.
LlavaGuard builds a full framework for visual dataset moderation using open multimodal models, finally giving us a customizable safety layer for images and VLM outputs.
Do LLMs Reason Causally Like Us? Yes, and sometimes better. A study shows GPT-4o and Claude outperform humans in avoiding associative bias in causal reasoning, though still fall short on subtler inference patterns.
There’s also research on robot navigation with Astra, semantic compression with BEAST, efficient model scaling with Kinetics, and safeguarding autonomous driving with SafeAuto.
Let’s get into it 👇
Contents
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
CP-Bench: Evaluating Large Language Models for Constraint Modelling
Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks
MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models
LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models
Do Large Language Models Reason Causally Like Us? Even Better?
Banyan: Improved Representation Learning with Explicit Structure
Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection
Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning
SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning
DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Authors: Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu
Source and references: https://arxiv.org/abs/2506.06251v1
Introduction
This paper introduces DesignBench, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end code generation. The benchmark addresses key limitations of existing evaluations, such as the lack of mainstream development framework integration, insufficient task coverage, and limited evaluation dimensions.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.