Content
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Certified Robustness to Data Poisoning in Gradient-Based Training
Provable acceleration for diffusion models under minimal assumptions
Conditional Forecasting of Margin Calls using Dynamic Graph Neural Networks
A Monte Carlo Framework for Calibrated Uncertainty Estimation in Sequence Prediction
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks?
EMMA: End-to-End Multimodal Model for Autonomous Driving
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
Authors: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
Source and references: https://arxiv.org/abs/2410.23266v1
Introduction
This study examines the visual temporal reasoning capabilities of Multimodal Foundation Models (MFMs) in video understanding tasks, highlighting the need for a more rigorous and effective benchmark.
Key Points
Existing benchmarks often overestimate MFMs' visual temporal reasoning capabilities by allowing questions to be solved using a single, few, or out-of-order frames.
Three principles with corresponding metrics are proposed to assess the effectiveness of benchmarks in evaluating visual temporal reasoning: Multi-Frame Gain, Frame Order Sensitivity, and Frame Information Disparity.
The authors introduce TOMATO, a novel benchmark designed to rigorously assess MFMs' temporal reasoning capabilities through carefully curated, human-annotated questions and a diverse set of videos.
Methodology
The authors conduct a comprehensive evaluation of 16 open-source and 7 proprietary MFMs on TOMATO, which comprises 1,484 questions spanning six temporal reasoning tasks applied to 1,417 videos, including 805 self-recorded and -generated videos.
Results and Findings
The best-performing open-source model, Qwen2-VL-72B, achieves 37.9% overall accuracy, outperforming all proprietary models, including GPT-4o at 37.7%.
However, both open-source and proprietary models remain significantly below human-level performance, which reaches 95.2% using full videos and 79.7% with 16 frames.
The analysis reveals more fundamental limitations in current MFMs, including their inability to interpret frames as a continuous sequence, over-reliance on common sense rather than visual input, and susceptibility to noisy information.
Implications and Conclusions
The authors believe TOMATO will serve as a crucial testbed for evaluating the next generation of MFMs and a call to the community to develop AI systems capable of comprehending the human world dynamics through the video modality.
CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
Authors: Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, Huaxiu Yao
Source and references: https://arxiv.org/abs/2406.06007v2
Introduction
This paper introduces CARES, a comprehensive benchmark for evaluating the trustworthiness of Medical Large Vision Language Models (Med-LVLMs) across five key dimensions: trustfulness, fairness, safety, privacy, and robustness.
Key Points
CARES comprises about 41K question-answer pairs in both closed and open-ended formats, covering 16 medical image modalities and 27 anatomical regions.
The benchmark is designed to comprehensively assess the trustworthiness of Med-LVLMs, which have significant impact on medical applications but pose reliability issues.
CARES evaluates trustfulness (factuality and uncertainty estimation), fairness (across demographics), safety (jailbreaking, overcautiousness, and toxicity), privacy, and robustness (out-of-distribution).
Methodology
The authors curated CARES from seven existing medical multimodal and image classification datasets, including open-ended and closed-ended question formats. The evaluation covers four open-source Med-LVLMs and two generic LVLMs for comparison.
Results and Findings
Trustfulness: Med-LVLMs exhibit significant factual hallucinations, especially on open-ended questions and rare modalities/anatomical regions. They also perform poorly in uncertainty estimation, often displaying overconfidence.
Fairness: Model performance varies across age, gender, and racial groups, with the elderly and non-Caucasian populations showing lower accuracy.
Safety: Med-LVLMs are susceptible to jailbreaking attacks, with some models exhibiting overcautiousness and toxicity issues.
Privacy: Med-LVLMs lack effective defenses against disclosing private information and often generate fabricated private data.
Robustness: Med-LVLMs fail to recognize and handle out-of-distribution data, continuing to respond despite lacking sufficient medical knowledge.
Implications and Conclusions
The comprehensive evaluation conducted in CARES aims to drive further standardization and the development of more reliable Med-LVLMs, as the current models exhibit significant trustworthiness concerns that could pose risks in medical applications.
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Authors: Shenghao Xie, Wenqiang Zu, Mingyang Zhao, Duo Su, Shilong Liu, Ruohua Shi, Guoqi Li, Shanghang Zhang, Lei Ma
Source and references: https://arxiv.org/abs/2410.22217v2
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.