State of AI

State of AI

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI's avatar
State of AI
Oct 22, 2024
∙ Paid
2

Share this post

State of AI
State of AI
Bi-Weekly AI Research Roundup
1
Share

Contents

  1. CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

  2. BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

  3. Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

  4. IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial Intelligence

  5. ChartifyText: Automated Chart Generation from Data-Involved Texts via LLM

  6. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

  7. DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems

  8. Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

  9. Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models

  10. Comprehensive benchmarking of large language models for RNA secondary structure prediction

  11. Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

  12. MagicPIG: LSH Sampling for Efficient LLM Generation

  13. Compute-Constrained Data Selection

  14. Diffusion Transformer Policy

  15. UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps


CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li

Source and references: https://arxiv.org/abs/2407.01511v2


Introduction

The paper introduces CRAB, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction.

Key Points

  • CRAB provides a comprehensive framework for evaluating cross-environment tasks in interactive environments, where the agent needs to operate simultaneously across various devices and platforms.

  • CRAB introduces a novel evaluation method called the graph evaluator, which checks the intermediate procedures of completing a task by decomposing the task into multiple sub-goals.

  • CRAB proposes a highly extensible graph-based task construction method called sub-task composition, allowing for efficient construction of various cross-environment tasks with corresponding graph evaluators.

Methodology

The CRAB framework uses a unified interface for agents to operate in all environments, defining actions by their name, the environment they belong to, a concrete description of their functionality, and the parameters with descriptions. The graph evaluator decomposes tasks into a directed acyclic graph (DAG), where each node is a sub-task, and the edges represent the sequential relationship between sub-tasks.

Results and Findings

Based on the CRAB framework, the authors developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. They evaluated four advanced Multimodal Language Models (MLMs) using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.

Implications and Conclusions

The CRAB framework and the Crab Benchmark-v0 provide a comprehensive platform for evaluating the capabilities of MLM-based autonomous agents in cross-environment tasks, showcasing their ability to plan actions, construct outputs for each environment, and remember what needs to be transferred, which is essential for solving complex real-world problems.


BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Authors: Shaozhe Hao, Xuantong Liu, Xianbiao Qi, Shihao Zhao, Bojia Zi, Rong Xiao, Kai Han, Kwan-Yee K. Wong

Source and references: https://arxiv.org/abs/2410.14672v1


Introduction

This paper introduces BiGR, a novel conditional image generation model that utilizes compact binary latent codes to achieve improved performance in both generative and discriminative tasks.

Key Points

  • BiGR is the first conditional generative model that unifies generation and discrimination within the same framework.

  • BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction.

  • The paper introduces a novel entropy-ordered sampling method to enable efficient image generation.

  • Extensive experiments validate BiGR's superior performance in generation quality and representation capabilities.

  • BiGR showcases zero-shot generalization across various vision tasks, including image inpainting, outpainting, editing, interpolation, and enrichment.

Methodology

BiGR is built upon a transformer-based language model architecture, with three major components: a binary tokenizer, a decoder-only transformer with full bidirectional attention, and a binary transcoder that transforms continuous features into Bernoulli-distributed binary codes. The model is trained using a masked modeling approach, where a portion of the input tokens are masked, and the model learns to predict the masked tokens.

Results and Findings

Experiments show that BiGR significantly outperforms the latest autoregressive generation baseline LlamaGen in both generative and discriminative performance, as measured by FID, Inception Score, and linear-probe accuracy. BiGR also demonstrates faster inference speed compared to other models due to its efficient sampling strategy.

Implications and Conclusions

The findings suggest that BiGR effectively unifies generative and discriminative tasks, paving the way for further advancements in the field of conditional image generation and visual representation learning.


Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

Authors: Chun Yin Kong, Picasso Vasquez, Makan Farhoodimoghadam, Chris Brandt, Titus C. Brown, Krystle L. Reagan, Allison Zwingenberger, Stefan M. Keller

Source and references: https://arxiv.org/abs/2410.14625v1


Introduction

This paper presents Anna, an animal health analytics platform that facilitates the integration of multiple machine learning (ML) classifiers with electronic health record (EHR) systems. Anna enables real-time analysis of veterinary laboratory data to aid clinical decision-making.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 StateOfAI
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share