State of AI

State of AI

Share this post

State of AI
State of AI
Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs

Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Jun 10, 2025
∙ Paid
6

Share this post

State of AI
State of AI
Benchmarking MLLMs for Front-End Code, Forging Diffusion Watermarks, and Modeling Constraints with LLMs
4
Share

Welcome to today’s edition of State of AI
👋 And a warm welcome to our 84 new subscribers since the last edition!

Today’s research dives into front-end code generation, diffusion watermark forgeries, and constraint modeling with LLMs. If you’re interested in MLLMs powering UI workflows, multimodal dataset safety, or how LLMs reason about cause and effect then this one’s for you.

Here’s what caught our attention:

  • DesignBench benchmarks MLLMs for automated front-end development across React, Vue, Angular, and HTML finally evaluating how well these models perform in real-world generation, editing, and repair workflows.

  • CP-Bench offers the first rigorous benchmark for LLMs in constraint programming, showing how system prompts and retrieval-augmented inference can improve translation from natural language to MiniZinc, OR-Tools, and CPMpy.

  • Plug-and-Plant reveals a watermark forgery method that requires no optimization and exploits regenerative diffusion models to plant convincing watermarks, challenging the reliability of current watermarking schemes.

  • Leopard introduces a vision-language model purpose-built for handling multiple text-rich images like web pages and multi-page documents, outperforming the state-of-the-art on 12 benchmarks.

  • MimeQA pushes video LLMs into a new direction by testing them on mime performances, highlighting just how poorly even advanced models understand nonverbal communication.

  • LlavaGuard builds a full framework for visual dataset moderation using open multimodal models, finally giving us a customizable safety layer for images and VLM outputs.

  • Do LLMs Reason Causally Like Us? Yes, and sometimes better. A study shows GPT-4o and Claude outperform humans in avoiding associative bias in causal reasoning, though still fall short on subtler inference patterns.

There’s also research on robot navigation with Astra, semantic compression with BEAST, efficient model scaling with Kinetics, and safeguarding autonomous driving with SafeAuto.

Let’s get into it 👇

Contents

  1. DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

  2. CP-Bench: Evaluating Large Language Models for Constraint Modelling

  3. Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models

  4. Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

  5. MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

  6. LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models

  7. Exploring Diffusion Transformer Designs via Grafting

  8. Do Large Language Models Reason Causally Like Us? Even Better?

  9. Efficient Diffusion Models: A Survey

  10. Banyan: Improved Representation Learning with Explicit Structure

  11. Kinetics: Rethinking Test-Time Scaling Laws

  12. Explaining Matters: Leveraging Definitions and Semantic Expansion for Sexism Detection

  13. Astra: Toward General-Purpose Mobile Robots via Hierarchical Multimodal Learning

  14. SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models

  15. BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Authors: Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

Source and references: https://arxiv.org/abs/2506.06251v1


Introduction

This paper introduces DesignBench, a comprehensive benchmark for evaluating Multimodal Large Language Models (MLLMs) in automated front-end code generation. The benchmark addresses key limitations of existing evaluations, such as the lack of mainstream development framework integration, insufficient task coverage, and limited evaluation dimensions.

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share