Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Oct 22, 2024

∙ Paid

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents
BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities
Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records
IncidentResponseGPT: Generating Traffic Incident Response Plans with Generative Artificial Intelligence
ChartifyText: Automated Chart Generation from Data-Involved Texts via LLM
MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models
Comprehensive benchmarking of large language models for RNA secondary structure prediction
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
MagicPIG: LSH Sampling for Efficient LLM Generation
Compute-Constrained Data Selection
Diffusion Transformer Policy
UADA3D: Unsupervised Adversarial Domain Adaptation for 3D Object Detection with Sparse LiDAR and Large Domain Gaps

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Authors: Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, Guohao Li

Source and references: https://arxiv.org/abs/2407.01511v2

Introduction

The paper introduces CRAB, the first agent benchmark framework designed to support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction.

Key Points

CRAB provides a comprehensive framework for evaluating cross-environment tasks in interactive environments, where the agent needs to operate simultaneously across various devices and platforms.
CRAB introduces a novel evaluation method called the graph evaluator, which checks the intermediate procedures of completing a task by decomposing the task into multiple sub-goals.
CRAB proposes a highly extensible graph-based task construction method called sub-task composition, allowing for efficient construction of various cross-environment tasks with corresponding graph evaluators.

Methodology

The CRAB framework uses a unified interface for agents to operate in all environments, defining actions by their name, the environment they belong to, a concrete description of their functionality, and the parameters with descriptions. The graph evaluator decomposes tasks into a directed acyclic graph (DAG), where each node is a sub-task, and the edges represent the sequential relationship between sub-tasks.

Results and Findings

Based on the CRAB framework, the authors developed a cross-platform Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone environments. They evaluated four advanced Multimodal Language Models (MLMs) using different single and multi-agent system configurations on this benchmark. The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.

Implications and Conclusions

The CRAB framework and the Crab Benchmark-v0 provide a comprehensive platform for evaluating the capabilities of MLM-based autonomous agents in cross-environment tasks, showcasing their ability to plan actions, construct outputs for each environment, and remember what needs to be transferred, which is essential for solving complex real-world problems.

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Authors: Shaozhe Hao, Xuantong Liu, Xianbiao Qi, Shihao Zhao, Bojia Zi, Rong Xiao, Kai Han, Kwan-Yee K. Wong

Source and references: https://arxiv.org/abs/2410.14672v1

Introduction

This paper introduces BiGR, a novel conditional image generation model that utilizes compact binary latent codes to achieve improved performance in both generative and discriminative tasks.

Key Points

BiGR is the first conditional generative model that unifies generation and discrimination within the same framework.
BiGR features a binary tokenizer, a masked modeling mechanism, and a binary transcoder for binary code prediction.
The paper introduces a novel entropy-ordered sampling method to enable efficient image generation.
Extensive experiments validate BiGR's superior performance in generation quality and representation capabilities.
BiGR showcases zero-shot generalization across various vision tasks, including image inpainting, outpainting, editing, interpolation, and enrichment.

Methodology

BiGR is built upon a transformer-based language model architecture, with three major components: a binary tokenizer, a decoder-only transformer with full bidirectional attention, and a binary transcoder that transforms continuous features into Bernoulli-distributed binary codes. The model is trained using a masked modeling approach, where a portion of the input tokens are masked, and the model learns to predict the masked tokens.

Results and Findings

Experiments show that BiGR significantly outperforms the latest autoregressive generation baseline LlamaGen in both generative and discriminative performance, as measured by FID, Inception Score, and linear-probe accuracy. BiGR also demonstrates faster inference speed compared to other models due to its efficient sampling strategy.

Implications and Conclusions

The findings suggest that BiGR effectively unifies generative and discriminative tasks, paving the way for further advancements in the field of conditional image generation and visual representation learning.

Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

Authors: Chun Yin Kong, Picasso Vasquez, Makan Farhoodimoghadam, Chris Brandt, Titus C. Brown, Krystle L. Reagan, Allison Zwingenberger, Stefan M. Keller

Source and references: https://arxiv.org/abs/2410.14625v1

Introduction

This paper presents Anna, an animal health analytics platform that facilitates the integration of multiple machine learning (ML) classifiers with electronic health record (EHR) systems. Anna enables real-time analysis of veterinary laboratory data to aid clinical decision-making.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

BiGR: Harnessing Binary Latent Codes for Image Generation and Improved Visual Representation Capabilities

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Enhancing AI Accessibility in Veterinary Medicine: Linking Classifiers and Electronic Health Records

Introduction

Keep reading with a 7-day free trial