Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI

Nov 08, 2024

∙ Paid

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder
ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks
Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks
Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models
ROBIN: Robust and Invisible Watermarks for Diffusion Models with Adversarial Optimization
Virchow2: Scaling Self-Supervised Mixed Magnification Models in Pathology
SynCode: LLM Generation with Grammar Augmentation
TableGPT2: A Large Multimodal Model with Tabular Data Integration
Navigating Extremes: Dynamic Sparsity in Large Output Space
Topology-guided Hypergraph Transformer Network: Unveiling Structural Insights for Improved Representation
Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent
CIBench: Evaluating Your LLMs with a Code Interpreter Plugin
OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs
Disability data futures: Achievable imaginaries for AI and disability data justice
GIS Copilot: Towards an Autonomous GIS Agent for Spatial Analysis

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Authors: Junqi Zhao, Xubo Liu, Jinzheng Zhao, Yi Yuan, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang

Source and references: https://arxiv.org/abs/2407.11745v2

Introduction

This paper proposes the use of a self-supervised pre-trained audio model, the audio masked autoencoder (A-MAE), to enhance the performance of a universal sound separation (USS) system.

Key Points

The authors integrate the self-supervised A-MAE model into a query-based USS system to obtain general audio representations.
They employ two strategies to utilize the SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning.
The SSL embeddings are concatenated with the short-time Fourier transform (STFT) features as input to the separation model.
The proposed methods are evaluated on the AudioSet dataset, and the results indicate improved separation performance compared to a state-of-the-art ResUNet-based USS model.

Methodology

The authors use a self-supervised pre-trained A-MAE model to extract universal audio features. During the USS training stage, they either freeze or partially update the parameters of the A-MAE encoder to obtain the SSL representations, which are then concatenated with the STFT features as input to the downstream separation model.

Results and Findings

The experimental results show that the proposed frozen approach achieved an AudioSet SDRi of 5.62 dB using the average embedding, outperforming the SOTA system's 5.18 dB by 0.44 dB. The class-wise analysis reveals that the proposed method achieved an SDRi of over 15 dB in some sound classes with common line spectrum characteristics, such as dial tone and smoke detector.

Implications and Conclusions

The research demonstrates the effectiveness of integrating self-supervised pre-trained audio models, such as A-MAE, into USS systems to enhance their separation performance. The proposed approach leverages the general audio representations learned by the self-supervised model to improve the USS system's ability to separate a wide range of sound sources.

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Authors: Ziji Shi, Jialin Li, Yang You

Source and references: https://arxiv.org/abs/2411.03999v1

Introduction

This paper introduces ParaGAN, a scalable distributed training framework for Generative Adversarial Networks (GANs) that leverages asynchronous training and an asymmetric optimization policy to accelerate GAN training.

Key Points

ParaGAN is the first distributed training framework that supports large-scale distributed training for high-resolution GAN.
ParaGAN employs a congestion-aware data pipeline and hardware-aware layout transformation to enhance accelerator utilization, resulting in over 30% improvements in throughput.
ParaGAN uses an asynchronous update scheme and an asymmetric optimization policy to decouple the training of the generator and discriminator, improving the stability of large-batch training.
With ParaGAN, the training time of BigGAN can be shortened from 15 days to 14 hours with 1024 TPU accelerators at 91% scaling efficiency, and it enables direct photo-realistic image generation at unprecedented 1024×1024 resolution.

Methodology

ParaGAN is designed with optimizations on both the system and numerical perspectives. On the system side, it uses a congestion-aware data pipeline and hardware-aware layout transformation to improve accelerator utilization. On the numerical side, it employs an asynchronous update scheme and an asymmetric optimization policy to stabilize the training of the generator and discriminator.

Results and Findings

ParaGAN achieves over 30% throughput improvements compared to the baseline by using the congestion-aware data pipeline and hardware-aware layout transformation. With these system-level optimizations and the numerical optimizations, ParaGAN is able to reduce the training time of BigGAN from 15 days to 14 hours while achieving 91% scaling efficiency on 1024 TPU accelerators. Additionally, ParaGAN enables unprecedented high-resolution image generation using BigGAN, producing 1024×1024 resolution images with an Inception Score of 239.3 and Fréchet Inception Distance of 13.6.

Implications and Conclusions

ParaGAN's ability to significantly accelerate GAN training and enable high-resolution image generation has important implications for the development and deployment of powerful generative models. The techniques introduced in ParaGAN, such as the asynchronous update scheme and asymmetric optimization policy, could also be applicable to improving the training stability and scalability of other types of large-scale neural network models.

Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks

Authors: Felipe Marra, Lucas N. Ferreira

Source and references: https://arxiv.org/abs/2411.03948v1

Introduction

This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs).

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

Bi-Weekly AI Research Roundup

Latest research summaries in ML, Robotics, CV, NLP and AI

Contents

Universal Sound Separation with Self-Supervised Audio Masked Autoencoder

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

ParaGAN: A Scalable Distributed Training Framework for Generative Adversarial Networks

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of Study in Tabletop Role-Playing Games Soundtracks

Introduction

Keep reading with a 7-day free trial