Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Aug 20, 2024

∙ Paid

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation
Large Language Models for Code: Security Hardening and Adversarial Testing
Self-Taught Optimizer (STOP): Recursively Self-Improving Code Generation
As Generative Models Improve, People Adapt Their Prompts
Geometry Informed Tokenization of Molecules for Language Model Generation
ARMADA: Attribute-Based Multimodal Data Augmentation
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
SANER: Annotation-free Societal Attribute Neutralizer for Debiasing CLIP
Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models
Adaptive Draft-Verification for Efficient Large Language Model Decoding
RAVEN: In-Context Learning with Retrieval-Augmented Encoder-Decoder Language Models
CRITERIA: a New Benchmarking Paradigm for Evaluating Trajectory Prediction Models for Autonomous Driving

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Authors: Xin Wang, Hector Delgado, Hemlata Tak, Jee-weon Jung, Hye-jin Shim, Massimiliano Todisco, Ivan Kukanov, Xuechen Liu, Md Sahidullah, Tomi Kinnunen, Nicholas Evans, Kong Aik Lee, Junichi Yamagishi

Source and references: https://arxiv.org/abs/2408.08739v1

Introduction

The ASVspoof 5 challenge promotes the study of speech spoofing and deepfake attacks, and the design of detection solutions. It has evolved from previous challenge editions, with a new database, spoofing attacks, and evaluation metrics.

Key Points

ASVspoof 5 combines logical access and speech deepfake tasks into a single challenge with two tracks: stand-alone spoofing and speech deepfake detection, and spoofing-robust automatic speaker verification (SASV).
The new database is built from crowdsourced data collected from a large number of speakers in diverse acoustic conditions, with attacks generated using the latest text-to-speech and voice conversion algorithms and optimized to compromise both automatic speaker verification (ASV) and countermeasure (CM) systems.
Adversarial attacks are introduced for the first time and combined with spoofing attacks.
New evaluation metrics include the minimum detection cost function (minDCF), actual DCF, log-likelihood-ratio cost (Cllr), and architecture-agnostic DCF (a-DCF).

Methodology

The database was built in three steps, with two groups of data contributors. The first group used a partition of the Multilingual LibriSpeech dataset to build text-to-speech systems, which were then used to generate spoofed data for training surrogate ASV and CM systems. The second group used another partition to build text-to-speech and voice conversion systems, which were tuned based on the surrogate systems' performance and then used to generate spoofed data for the evaluation set.

Results and Findings

The baseline systems achieved minDCF higher than 0.7 and equal error rates (EERs) higher than 29%, indicating that the non-studio-quality data and advanced spoofing attacks posed significant challenges. However, most submissions in the closed condition outperformed the baselines, with the top-5 submissions achieving minDCF below 0.5 and EERs below 15%. Submissions in the open condition, especially those using pre-trained self-supervised learning models, performed even better. Results also revealed the importance of score calibration for practical deployment.

Implications and Conclusions

The ASVspoof 5 challenge represents a significant advancement in the study of speech spoofing and deepfake attacks, with a more realistic and challenging database and the introduction of adversarial attacks. The results demonstrate that while substantial progress has been made, there is still room for improvement, particularly in terms of score calibration, to ensure the practical deployment of effective detection solutions.

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Authors: Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu

Source and references: https://arxiv.org/abs/2408.08872v1

Introduction

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.

Key Points

xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models.
The models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
The pre-trained base model exhibits strong in-context learning capabilities, and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes.
A safety-tuned model with DPO is introduced, aiming to mitigate harmful behaviors such as hallucinations and improve safety.
The models, curated large-scale datasets, and fine-tuning codebase are open-sourced to facilitate further advancements in LMM research.

Methodology

The xGen-MM (BLIP-3) framework streamlines the model architecture by replacing the Q-Former with a more scalable vision token sampler (a perceiver resampler) and simplifying the training objectives to focus solely on the auto-regressive loss of text tokens in a multimodal context. The primary focus is on dataset curation and scaling up the training data, including the introduction of two large-scale, high-quality datasets: MINT-1T and BLIP3-KALE.

Results and Findings

The pre-trained base model exhibits strong in-context learning capabilities, outperforming previous models on various benchmarks, including OCR-related tasks. The instruction-tuned model also demonstrates competitive performance among open-source LMMs with similar model sizes. The safety-tuned model with DPO shows Pareto gains in model harmlessness and helpfulness.

Implications and Conclusions

By open-sourcing the xGen-MM (BLIP-3) models, curated large-scale datasets, and fine-tuning codebase, the researchers aim to make LMM research and development more accessible to the community and encourage further exploration of the potential and emergent abilities of LMMs.

SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

Authors: Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Feilong Tang, Ying Chen, Siying Li, Jie Ma, Guanbin Li

Source and references: https://arxiv.org/abs/2408.08870v1

Introduction

This research paper proposes SAM2-UNet, a framework that utilizes the Segment Anything Model 2 (SAM2) as a strong encoder for versatile image segmentation tasks.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

SAM2-UNet: Segment Anything 2 Makes Strong Encoder for Natural and Medical Image Segmentation

Introduction

Keep reading with a 7-day free trial