Bi-Weekly AI Research Roundup + State of AI Podcast 🤖🎙

Latest research summaries in ML, Robotics, CV, NLP and AI

Dec 03, 2024

Excited to announce that all State of AI articles now include an audio podcast version, making it easier to stay informed while commuting or multitasking. Enjoy!

State of AI Podcast • 001

State of AI

December 3, 2024

Read full story

Get 7 day free trial

Recent Advances in Attack and Defense Approaches of Large Language Models
LUMIA: Linear probing for Unimodal and MultiModal Membership Inference Attacks leveraging internal LLM states
InvDesFlow: An AI search engine to explore possible high-temperature superconductors
Efficient Multi-modal Large Language Models via Visual Token Grouping
GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models
Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability
BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion
Constraining Generative Models for Engineering Design with Negative Data
Topology-Based Reconstruction Prevention for Decentralised Learning
Scaling Speech-Text Pre-training with Synthetic Interleaved Data
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning
CREW: Facilitating Human-AI Teaming Research
Inclusive Design of AI's Explanations: Just for Those Previously Left Out, or for Everyone?
ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

Recent Advances in Attack and Defense Approaches of Large Language Models

Authors: Jing Cui, Yishi Xu, Zhewei Huang, Shuchang Zhou, Jianbin Jiao, Junge Zhang

Source and references: https://arxiv.org/abs/2409.03274v3

Introduction

This paper reviews the current state of research on attack and defense approaches targeting Large Language Models (LLMs). LLMs have revolutionized artificial intelligence, but their widespread deployment has raised significant safety and reliability concerns.

Key Points

Explores vulnerabilities inherent in LLMs, including overfit ting, fine-tuning, quantization, and reinforcement learning with human feedback (RLHF)
Categorizes and analyzes emerging attack methods, including post-training attacks on fine-tuning and RLHF, as well as adversarial attacks like jailbreaks and prompt injection
Examines current defense strategies and highlights their limi tations, proposing future research directions to enhance LLM security

Methodology

The paper surveys the latest research on LLM vulnerabilities, attack methods, and defense strategies. It provides a comprehensive overview of the current landscape, linking attacks to identified vulnerabilities and analyzing the strengths and weaknesses of contemporary defenses.

Results and Findings

The analysis reveals that LLMs are susceptible to a variety of attacks, including backdoor injections, data poisoning, and adversarial prompts that can bypass safety mechanisms. Current defense approaches often struggle to keep pace with the evolving attack methods, particularly in areas like RLHF and supply chain vulnerabilities.

Implications and Conclusions

Understanding the current state of LLM attacks and defenses is crucial for developing more robust security measures. This survey aims to guide future research and help the community navigate the evolving landscape of LLM safety and reliability challenges.

LUMIA: Linear probing for Unimodal and MultiModal Membership Inference Attacks leveraging internal LLM states

Authors: Luis Ibanez-Lissen, Lorena Gonzalez-Manzano, Jose Maria de Fuentes, Nicolas Anciaux, Joaquin Garcia-Alfaro

Source and references: https://arxiv.org/abs/2411.19876v2

Introduction

This paper proposes a novel approach called LUMIA to detect Membership Inference Attacks (MIAs) on Large Language Models (LLMs) by analyzing their internal activations using Linear Probes (LPs).

Key Points

LUMIA applies LPs layer-by-layer to obtain fine-grained data on the inner workings of LLMs and assess their vulnerability to MIAs.
LUMIA is evaluated across several model architectures, sizes, and datasets, including unimodal and multimodal tasks.
In unimodal MIA, LUMIA achieves an average gain of 15.71% in Area Under the Curve (AUC) over previous techniques.
LUMIA reaches AUC>60% in 65.33% of cases – an increment of 46.80% against the state of the art.
LUMIA's approach reveals key insights, such as the model layers where MIAs are most detectable.
In multimodal models, LPs indicate that visual inputs can significantly contribute to detect MIAs, reaching AUC>60% in 85.90% of experiments.

Methodology

LUMIA formulates the problem of membership inference using internal activations and trains LP classifiers on these activations to assess the distribution of membership information across the model's layers. This approach is applied to both unimodal and multimodal LLMs.

Results and Findings

LUMIA significantly outperforms previous MIA techniques, achieving an average gain of 15.71% in AUC for unimodal models and reaching AUC>60% in 65.33% of cases - a 46.80% improvement over the state of the art. In multimodal models, LUMIA shows that visual inputs can contribute substantially to detecting MIAs, reaching AUC>60% in 85.90% of experiments.

Implications and Conclusions

This research demonstrates the effectiveness of leveraging internal model activations, through the use of linear probes, for assessing membership inference attacks on large language and multimodal models. The insights gained can inform the development of more transparent and accountable AI systems.

InvDesFlow: An AI search engine to explore possible high-temperature superconductors

Authors: Xiao-Qi Han, Zhenfeng Ouyang, Peng-Jie Guo, Hao Sun, Ze-Feng Gao, Zhong-Yi Lu

Source and references: https://arxiv.org/abs/2409.08065v2

Introduction

The paper presents InvDesFlow, an AI search engine developed to accelerate the discovery of high-temperature superconducting materials.

Key Points

InvDesFlow integrates deep model pre-training and fine-tuning techniques, diffusion models, and physics-based approaches to search for high-Tc superconductors.
Utilizing InvDesFlow, the authors obtained 74 dynamically stable materials with critical temperatures predicted to be Tc≥15 K, which are not contained in any existing dataset.
The paper analyzes trends in the dataset and highlights the properties of specific materials, such as B4CN3 (Tc=24.08 K) and B5CN2 (Tc=15.93 K).
The AI search engine's flexibility allows it to be tailored for the discovery of various functional materials with targeted properties, expanding its utility across materials science.

Methodology

The InvDesFlow approach consists of several key components: a symmetry-constrained crystal generation model based on diffusion generative models and graph neural networks, a superconducting classification model utilizing pre-training and fine-tuning techniques, a formation energy prediction model, and the ALIGNN model for Tc prediction. These components are integrated to generate, screen, and validate potential high-Tc superconducting materials.

Results and Findings

By utilizing InvDesFlow, the authors obtained 74 dynamically stable materials with critical temperatures predicted by the AI model to be Tc≥15 K. The paper provides a detailed analysis of two materials, B4CN3 (Tc=24.08 K) and B5CN2 (Tc=15.93 K), demonstrating their structural, electronic, and superconducting properties through first-principles calculations.

Implications and Conclusions

The development of InvDesFlow represents a significant advancement in the quest for high-Tc superconductors, as it is capable of discovering new crystal structures not present in existing databases. The researchers emphasize the adaptability of the AI search engine, which can be tailored for the discovery of various functional materials with targeted properties, thereby greatly expanding its utility across the field of materials science.

Efficient Multi-modal Large Language Models via Visual Token Grouping

Authors: Minbin Huang, Runhui Huang, Han Shi, Yimeng Chen, Chuanyang Zheng, Xiangguo Sun, Xin Jiang, Zhenguo Li, Hong Cheng

Source and references: https://arxiv.org/abs/2411.17773v2

Introduction

This paper introduces VisToG, a novel grouping mechanism to improve the efficiency of Multi-modal Large Language Models (MLLMs) by reducing the computational cost associated with processing high-resolution visual inputs.

Key Points

VisToG leverages pre-trained vision encoders to group similar image segments without the need for segmentation masks, reducing the number of visual tokens processed by the language model.
VisToG employs an isolated attention mechanism to preserve the integrity of the original image representations while allowing the semantic tokens to specialize in creating meaningful groupings.
The authors propose a two-stage training pipeline to first align the image features with the language model, followed by fine-tuning the model for instruction-aware visual token grouping.
Extensive experiments demonstrate that VisToG can maintain 98.1% of the original performance while reducing inference time by over 27%.
VisToG outperforms existing methods for visual token reduction, such as Q-Former and adaptive average pooling, across various downstream tasks.

Methodology

VisToG introduces a novel grouping mechanism that leverages pre-trained Vision Transformers to cluster similar image segments into semantically related concepts. This approach aims to eliminate the need to encode redundant vision tokens, thereby optimizing computational efficiency. The authors also employ an isolated attention mechanism to prevent the original image tokens from interacting directly with the newly added semantic tokens, preserving the integrity of the image representations.

Results and Findings

The authors conduct extensive experiments on a range of datasets, including GQA, TextVQA, POPE, and others. VisToG consistently outperforms existing methods, such as LLaVA-rand and LLaVA-AvgPool, across different token counts. For example, VisToG maintains relatively high performance even as the number of tokens decreases, while the performance of the other methods degrades sharply when the token count falls below 64.

Implications and Conclusions

The proposed VisToG mechanism represents a significant advancement in improving the efficiency of Multi-modal Large Language Models. By effectively reducing the computational costs associated with processing high-resolution visual inputs, VisToG enables the broader deployment of MLLMs in resource-constrained environments. This research provides valuable insights for training larger MLLMs with minimal image token redundancies, ultimately enhancing the capabilities and practicality of these powerful models.

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Authors: M. Jehanzeb Mirza, Mengjie Zhao, Zhuoyuan Mao, Sivan Doveh, Wei Lin, Paul Gavrikov, Michael Dorkenwald, Shiqi Yang, Saurav Jha, Hiromi Wakaki, Yuki Mitsufuji, Horst Possegger, Rogerio Feris, Leonid Karlinsky, James Glass

Source and references: https://arxiv.org/abs/2410.06154v2

Introduction

This paper proposes a novel method called GLOV (Guided Large Language Models as Implicit Optimizers for Vision Language Models) that enables Large Language Models (LLMs) to act as implicit Optimizers for Vision-Language Models (VLMs) to enhance downstream vision tasks.

Key Points

GLOV meta-prompts an LLM with the downstream task description, querying it for suitable VLM prompts (e.g., for zero-shot classification with CLIP).
The prompts are ranked according to their fitness for the downstream vision task, and the ranked prompts are fed as in-context examples (with their accuracies) to equip the LLM with the knowledge of the type of prompts preferred by the downstream VLM.
GLOV explicitly steers the LLM generation in each optimization step by adding an offset difference vector of the embeddings from the positive and negative solutions found by the LLM, in previous optimization steps, to the intermediate layer of the network for the next generation step.
This offset vector steers the LLM generation toward the type of language preferred by the downstream VLM, resulting in enhanced performance on the downstream vision tasks.

Methodology

The paper frames the problem of finding suitable natural language prompts for VLMs as an optimization problem, with the objective of improving performance on downstream vision tasks. GLOV employs a prompt search technique relying on a meta-prompt coupled with embedding space guidance, that drives the prompt optimization for the VLMs.

Results and Findings

GLOV is comprehensively evaluated on 16 diverse datasets using two families of VLMs, i.e., dual-encoder (e.g., CLIP) and encoder-decoder (e.g., LLaVa) models. The discovered solutions can enhance the recognition performance by up to 15.0% and 57.5% (3.8% and 21.6% on average) for these models.

Implications and Conclusions

The proposed GLOV method enables LLMs to act as implicit Optimizers for VLMs, leading to significant improvements in downstream vision tasks, without requiring any gradient-based learning or parameter update.

Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability

Authors: Bo Gao, Fangxu Xing, Daniel Tang

Source and references: https://arxiv.org/abs/2411.01819v2

Introduction

This research paper proposes Free-Mask, a novel framework that combines a diffusion model for semantic segmentation with advanced image editing capabilities to automatically generate diverse synthetic datasets with precise segmentation masks. This approach aims to overcome the limitations of existing methods that struggle with generating multi-instance images or achieving accurate mask generation.

Key Points

Reverses the traditional segmentation task from image to mask, generating images from precise masks to enrich the data domain and enhance performance in open-world scenarios.
Generates both single-instance and complex multi-instance images with accurate masks, applying mixing and filtering to better match real-world scenarios.
Presents the Free-Mask framework, which collaborates the diffusion model for segmentation with image editing techniques to handle diverse inputs and allow for various edits.

Methodology

The proposed method consists of two main steps. First, it generates images with a single object per image and their corresponding masks using cross-attention maps from a diffusion model. Then, it proceeds to the image editing phase to address challenges such as determining appropriate objects to add, placing them within the scene, and harmonizing the foreground and background.

Results and Findings

The authors conduct extensive experiments on the PASCAL VOC 2012 and Cityscapes datasets. Their results demonstrate that synthetic data generated by Free-Mask enables segmentation models to outperform those trained on real data, especially in zero-shot settings. Free-Mask achieves new state-of-the-art results on previously unseen classes in the VOC 2012 benchmark.

Implications and Conclusions

The proposed Free-Mask framework represents a significant advancement in the field of semantic segmentation, offering an efficient and effective way to generate diverse synthetic datasets with precise segmentation masks. This approach has the potential to reduce the burden of manual annotation and enhance the performance of segmentation models in open-world scenarios.

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Authors: Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi

Source and references: https://arxiv.org/abs/2305.15798v4

Introduction

This paper focuses on compressing Stable Diffusion models (SDMs) for efficient text-to-image (T2I) generation through architectural modifications and distillation-based retraining.

Key Points

Achieved up to 51% reduction in model size and 43% improvement in latency by removing multiple residual and attention blocks from the U-Net in SDMs.
Demonstrated the notable benefit of feature distillation for training diffusion models, enabling competitive T2I performance with significantly fewer resources (13 A100 days and 0.22M LAION pairs) compared to training SDMs from scratch (over 6,000 A100 days and 2,000M pairs).
Showed the practicality of the compact models across various aspects, including personalized generation, image-to-image translation, and mobile deployment.
Publicly released the approach, model weights, and source code, motivating subsequent works by other researchers.

Methodology

The authors compress the U-Net in SDMs by removing architectural blocks, including fewer blocks in the down and up stages, removing the entire mid-stage, and further pruning the innermost stages. They then retrain the compact models using feature-level knowledge distillation to mimic the behavior of the original SDMs.

Results and Findings

Despite using far fewer training resources, the authors' compact models, named BK-SDMs, achieve competitive zero-shot performance on MS-COCO compared to large-scale baselines. The distillation-based retraining is crucial, enabling the compact models to effectively learn from the original SDMs. The authors also demonstrate the applicability of BK-SDMs in personalized generation and image-to-image translation tasks, as well as their efficiency in mobile deployment (less than 4 seconds inference time on Jetson AGX Orin and iPhone 14).

Implications and Conclusions

The authors' work highlights the surprising potential of classical architectural compression techniques for building efficient and capable diffusion models, which can be further combined with other optimization methods. Their publicly released code and models can facilitate subsequent research on structural compression of large diffusion models.

Constraining Generative Models for Engineering Design with Negative Data

Authors: Lyle Regenwetter, Giorgio Giannone, Akash Srivastava, Dan Gutfreund, Faez Ahmed

Source and references: https://arxiv.org/abs/2306.15166v2

Introduction

This paper introduces a novel training method for generative models to guide them towards constraint-satisfying outputs using 'negative data' - examples of what to avoid. The authors present a new negative-data generative model (NDGM) formulation that outperforms classic models, generating significantly fewer constraint-violating samples using much less data.

Key Points

The authors introduce a new NDGM formulation that overcomes issues in the prior state-of-the-art method, such as lack of density ratio learning and mode collapse.
Their model is extensively benchmarked across numerous synthetic tests and real engineering problems, demonstrating best-in-class performance and overall dominance over classic generative models.
NDGMs can generate 1/6 as many constraint-violating samples using 1/8 as much data compared to vanilla models in certain problems.
The authors' benchmarks showcase the potency of their new NDGM formulation and the overall advantages of NDGMs versus classic generative models.
The authors publicly release the code and benchmarks for their work.

Methodology

The authors propose a new NDGM training formulation that introduces two key innovations: 1) Learning individual density ratios between the positive, negative, and fake distributions using a multi-class discriminator, rather than conflating negatives and fakes. 2) Adding a Determinantal Point Process-based diversity loss to mitigate mode collapse issues in NDGMs.

Results and Findings

The authors' extensive benchmarking demonstrates that their GAN-MDD model frequently achieves an unmatched tradeoff between constraint satisfaction and distributional similarity, outperforming both baseline models and the current state-of-the-art NDGMs. In some cases, GAN-MDD attains 95-98% lower constraint violation than classic generative models while achieving top-three distributional similarity scores.

The authors also show that NDGMs can be significantly more data-efficient than vanilla models, generating 1/6 as many constraint-violating samples using only 1/8 as much data in certain problems.

Implications and Conclusions

The authors' new NDGM formulation offers significantly improved constraint satisfaction over classic generative models, making it a valuable tool for engineering design and other domains with strict physical or safety-critical constraints. The widespread superiority of their approach, as demonstrated across numerous benchmarks, suggests that NDGMs are an underutilized but powerful technique that can greatly benefit constrained generative modeling applications.

Topology-Based Reconstruction Prevention for Decentralised Learning

Authors: Florine W. Dekker, Zekeriya Erkin, Mauro Conti

Source and references: https://arxiv.org/abs/2312.05248v3

Introduction

This research paper analyzes reconstruction attacks in decentralized learning, where data and coordination are distributed across users. The authors show that passive honest-but-curious adversaries can infer other users' private data after multiple privacy-preserving summations, even without exploiting the inner workings of the summation protocol or having auxiliary knowledge.

Key Points

Passive honest-but-curious adversaries can reconstruct private data in decentralized learning, with a success rate that depends only on their local neighborhood, not the full network size.
Reconstruction attacks require a number of adversaries that is linear in the length of the network's shortest cycle (girth).
Exact reconstruction attacks are impossible in acyclic networks, regardless of the number of adversaries.
The authors propose a topology-based decentralized defense against reconstruction attacks, by restricting how summations can be composed.

Methodology

The authors model the network as an undirected graph, with users as nodes and communication links as edges. They allow privacy-preserving summation over direct neighbors, and consider a set of colluding adversaries that are honest-but-curious.

Results and Findings

The authors show that in random peer-to-peer subgraphs, three adversaries with 15 neighbors can successfully reconstruct at least one neighbor's private datum 11.0% of the time, requiring an average of 8.8 summations per adversary. They prove that the success rate depends only on the adversaries' local neighborhood, not the full network size.

The authors further show that reconstruction corresponds to cycles in the graph: If the graph's shortest cycle has length 2k, then reconstruction never succeeds if there are fewer than k adversaries.

Implications and Conclusions

This work takes a step towards a formal theory of topology-based decentralized reconstruction defenses. Such a theory could generalize the authors' countermeasure beyond summation, define confidentiality in terms of entropy, and describe the interactions with (topology-aware) differential privacy.

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Authors: Aohan Zeng, Zhengxiao Du, Mingdao Liu, Lei Zhang, Shengmin Jiang, Yuxiao Dong, Jie Tang

Source and references: https://arxiv.org/abs/2411.17607v2

Introduction

This paper presents a novel approach for scaling speech-text pre-training by leveraging synthetic interleaved data derived from text corpora, addressing the data limitation challenges in traditional speech-text pre-training.

Key Points

Propose a method to efficiently synthesize high-quality interleaved speech-text data from text corpora, eliminating the need for parallel speech-text datasets.
Design a SpeechLM architecture with a supervised 12.5Hz speech tokenizer and a flow-matching based decoder, achieving both robust semantic preservation and high-quality speech synthesis.
Scale the pre-training to 1 trillion tokens using the synthesized interleaved speech-text data, significantly advancing capabilities in speech language modeling and spoken question answering.
Develop an end-to-end spoken chatbot by fine-tuning the pre-trained model with speech dialogue data, achieving competitive performance in conversational abilities and speech quality.

Methodology

The authors train a text-to-token model to convert text into corresponding speech tokens, and then efficiently construct interleaved speech-text data by sampling text spans from existing text corpora and transforming them into speech spans using the trained text-to-token model. They also employ a supervised speech tokenizer derived from an ASR model to enable discrete speech tokens with strong semantic preservation.

Results and Findings

The experiments show that scaling the synthetic interleaved data from 0B to 600B tokens significantly improves the model's performance on speech language modeling and spoken question answering tasks, surpassing previous state-of-the-art results. The authors also demonstrate that by fine-tuning the pre-trained model with speech dialogue data, they can develop an end-to-end spoken chatbot that achieves competitive performance compared to existing baselines.

Implications and Conclusions

The proposed approach effectively addresses the data limitation challenges in traditional speech-text pre-training by leveraging large-scale text corpora to synthesize interleaved speech-text data. The resulting speech language models exhibit significant advancements in speech-related capabilities, paving the way for more natural and intuitive human-AI interaction through voice-based interfaces.

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Authors: Ezra Karger, Houtan Bastani, Chen Yueh-Han, Zachary Jacobs, Danny Halawi, Fred Zhang, Philip E. Tetlock

Source and references: https://arxiv.org/abs/2409.19839v3

Introduction

This paper introduces ForecastBench, a dynamic benchmark that evaluates the accuracy of machine learning (ML) systems on an automatically generated and regularly updated set of 1,000 forecasting questions about future events with no known answers at the time of submission.

Key Points

ForecastBench is a dynamic benchmark that continuously updates with new forecasting questions from various data sources.
The benchmark avoids any possibility of data leakage by consisting solely of questions about future events.
The authors quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and large language models (LLMs).
Expert forecasters outperform the top-performing LLM, despite LLMs achieving super-human performance on many other benchmarks.
The authors display system and human scores in a public leaderboard at www.forecastbench.org.

Methodology

The authors introduce seven baseline approaches for evaluating LLM forecasting performance, including zero-shot prompting, prompting with scratchpad instructions, and prompting with scratchpad instructions and retrieved news articles. They also incorporate crowd forecasts as additional inputs to the LLM models.

Results and Findings

The results show that the median public survey participant had an overall Brier score of 0.111, while the median superforecaster participant had a Brier score of 0.091 on the 200-item subset of the benchmark. In comparison, the top-performing LLM (Claude-3.5 Sonnet) had a Brier score of 0.114, significantly underperforming the expert human forecasters.

Implications and Conclusions

The findings from ForecastBench leave significant room for researchers to attempt to improve AI-based forecasting systems using innovative approaches, such as developing methods for continuously updating models with current events and enhancing LLMs to reason over extended time frames. The authors publish an auxiliary dataset of LLM and human forecasts, rationales, and accuracy for use in future LLM fine-tuning and testing.

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Authors: Jianyi Zhang, Hao Frank Yang, Ang Li, Xin Guo, Pu Wang, Haiming Wang, Yiran Chen, Hai Li

Source and references: https://arxiv.org/abs/2409.06067v2

Introduction

This paper introduces a novel federated learning framework called Multimodal Large Language Model Assisted Federated Learning (MLLM-LLaVA-FL), which leverages the power of multimodal large language models (MLLMs) to address the challenges of data heterogeneity and long-tailed distributions in federated learning.

Key Points

The framework integrates MLLMs as auxiliary tools to harness the extensive open-source data available on the internet and leverage the server-side computational capabilities.
It employs a three-stage approach: global multimodal pretraining, federated local finetuning, and global alignment.
The global multimodal pretraining stage utilizes MLLMs to generate detailed descriptions for unlabeled online data, which is then used to pretrain the federated learning model.
The federated local finetuning stage allows the integration of various existing federated learning methods, while the global alignment stage refines the model's outputs under MLLM supervision to mitigate the effects of long-tailed distributions.
Compared to previous approaches, the framework enhances privacy protection and reduces the computational burden on client devices.

Methodology

The framework consists of three key stages:

Global Multimodal Pretraining: The authors leverage MLLMs, such as GPT-4, to generate detailed descriptions, complex reasoning, and conversational data from unlabeled online images. This pretraining data is then used to train the federated learning model in a dynamic weighted manner, gradually increasing the influence of the compact federated learning model.
Federated Local Finetuning: The pretrained federated learning model is distributed to clients for local training, which can be done using various existing federated learning methods.
Global Alignment: After aggregation, the federated learning model is further refined on the server side using an alignment dataset and a global alignment function, which can be designed to address challenges like long-tailed distributions.

Results and Findings

The authors evaluate their MLLM-LLaVA-FL framework on established benchmarks, including CIFAR-10-LT, CIFAR-100-LT, and ImageNet-LT, and compare it with various state-of-the-art federated learning methods. The results show that MLLM-LLaVA-FL consistently outperforms the existing approaches in handling heterogeneity and class-distribution imbalance, achieving higher classification accuracy across the datasets.

Implications and Conclusions

The proposed MLLM-LLaVA-FL framework demonstrates the effectiveness of integrating multimodal large language models into federated learning to address the challenges of data heterogeneity and long-tailed distributions. By leveraging open-source data and server-side computational resources, the framework enhances the performance of federated learning while maintaining privacy and reducing the computational burden on client devices.

CREW: Facilitating Human-AI Teaming Research

Authors: Lingyu Zhang, Zhengran Ji, Boyuan Chen

Source and references: https://arxiv.org/abs/2408.00170v2

Introduction

This paper introduces CREW, a platform designed to facilitate research on human-AI teaming. The platform aims to address the challenges in existing solutions, which often support limited scenarios, single tasks, or focus solely on either human-teaming research or multi-agent AI algorithms.

Key Points

CREW is designed around key principles such as an extensible and open environment, real-time communication support, hybrid human-AI teaming modes, parallel sessions for scalable experiments, and comprehensive human and agent data collection.
CREW incorporates highly modular algorithm components compatible with practices in the machine learning community.
The platform enables benchmarking of real-time human-guided reinforcement learning (RL) algorithms alongside various RL baselines.
CREW includes a set of cognitive tests to explore how individual differences among humans impact their effectiveness in training AI agents.
CREW is the first platform to unify the desired features for human-AI teaming research across multiple disciplines.

Methodology

CREW is developed with a focus on key design principles, including an extensible and open environment, real-time communication support, hybrid human-AI teaming modes, parallel sessions for scalable experiments, and comprehensive human and agent data collection. The platform also incorporates modular algorithm components compatible with machine learning practices.

Results and Findings

The authors demonstrate CREW's potential by benchmarking real-time human-guided reinforcement learning (RL) algorithms alongside various RL baselines. Using CREW, the researchers were able to conduct 50 human subject studies within a week, showcasing the platform's capabilities for scalable experiments.

Implications and Conclusions

CREW aims to serve as an infrastructural foundation for multidisciplinary, reproducible, and scalable human-AI teaming research. By unifying the desired features for this research area, the platform can facilitate collaboration across multiple scientific domains and drive further advancements in the field of human-AI interaction.

Inclusive Design of AI's Explanations: Just for Those Previously Left Out, or for Everyone?

Authors: Md Montaser Hamid, Fatima Moussaoui, Jimena Noa Guevara, Andrew Anderson, Puja Agarwal, Jonathan Dodge, Margaret Burnett

Source and references: https://arxiv.org/abs/2404.13217v3

Introduction

This paper investigates whether the use of inclusive design approaches can lead to "curb-cut" improvements in Explainable Artificial Intelligence (XAI) systems, where fixes targeted at underserved users end up benefiting everyone.

Key Points

The paper examines the effects of inclusivity-driven fixes in an XAI prototype developed by an AI product team that used an inclusive design approach (GenderMag).
The objective is to investigate the curb-cut effects of the AI team's inclusivity-driven fixes on users' mental models of the AI system.
The study compares the mental model concepts scores, prediction accuracy, and inclusivity between users of the original XAI prototype and the version with the inclusivity fixes.

Methodology

The researchers conducted a between-subject study with 69 participants with no AI background. 34 participants used the original version of the XAI prototype, while 35 used the version with the AI team's inclusivity fixes. The study aimed to understand the curb-cut effects of the inclusivity-driven fixes on the participants' mental models, prediction accuracy, and inclusivity.

Results and Findings

The study produced four main results:

The inclusivity fixes led to several curb-cut effects, such as increased engagement with explanations and better mental model concepts scores.
However, the inclusivity fixes did not improve participants' prediction accuracy scores and instead seemed to have a negative impact, revealing a "curb-fence" effect.
The inclusivity fixes brought significant improvements for users whose problem-solving styles had previously been underserved.
The fixes reduced the gender gap in the participants' performance by 45%.

Implications and Conclusions

The research suggests that inclusive design approaches can lead to curb-cut improvements in XAI systems, where fixes targeted at underserved users can also benefit everyone else. However, the findings also reveal that such fixes can have unexpected negative consequences, highlighting the complexity of designing effective XAI systems for diverse users.

ConvMixFormer- A Resource-efficient Convolution Mixer for Transformer-based Dynamic Hand Gesture Recognition

Authors: Mallika Garg, Debashis Ghosh, Pyari Mohan Pradhan

Source and references: https://arxiv.org/abs/2411.07118v3

Introduction

This paper proposes a novel and resource-efficient ConvMixFormer architecture for dynamic hand gesture recognition, which replaces the self-attention mechanism in traditional transformers with a simple convolutional layer for efficient spatial feature extraction.

Key Points

Designed a lightweight ConvMixFormer model that uses convolution as the token mixer instead of the computationally expensive self-attention mechanism.
Introduced a Gated Depthwise Feed Forward Network (GDFN) to effectively control the flow of information within the model.
Achieved state-of-the-art performance on the NVGesture and Briareo datasets for both single and multimodal inputs.
Demonstrated significant reduction in model parameters and computational complexity compared to other transformer-based approaches.

Methodology

The proposed ConvMixFormer model takes a sequence of video frames as input and uses a pre-trained ResNet-18 network to extract frame-level features. These features are then passed through a series of ConvMixFormer stages, where the self-attention mechanism is replaced with a convolution-based token mixer. The GDFN is used to selectively filter the information flow within the model. Finally, the encoded features are average pooled and passed through a linear classifier to predict the gesture class.

Results and Findings

On the NVGesture dataset, ConvMixFormer achieves comparable results to the conventional transformer model, with a 1-2.5% drop in accuracy for single modal inputs. For multimodal inputs, the proposed model outperforms the transformer by up to 2.4% in accuracy. On the Briareo dataset, ConvMixFormer significantly outperforms the transformer, with an improvement of up to 8.45% in single modal accuracy and 1.85% in multimodal accuracy.

The proposed model also demonstrates a significant reduction in the number of parameters, with nearly half the parameters compared to the transformer-based approaches. Additionally, the computational complexity, measured in terms of MACs, is also lower than other state-of-the-art methods.

Implications and Conclusions

The ConvMixFormer model showcases the effectiveness of using a convolution-based token mixer in transformers for dynamic hand gesture recognition tasks. By replacing the self-attention mechanism with a more efficient convolution layer, the proposed model achieves comparable or better performance with significantly fewer parameters and reduced computational complexity. This makes it a promising approach for deploying transformer-based models in resource-constrained environments, such as on-device gesture recognition applications.

Bi-Weekly AI Research Roundup + State of AI Podcast 🤖🎙

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI Podcast • 001

Contents

Recent Advances in Attack and Defense Approaches of Large Language Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

LUMIA: Linear probing for Unimodal and MultiModal Membership Inference Attacks leveraging internal LLM states

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

InvDesFlow: An AI search engine to explore possible high-temperature superconductors

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Efficient Multi-modal Large Language Models via Visual Token Grouping

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Free-Mask: A Novel Paradigm of Integration Between the Segmentation Diffusion Model and Image Editing to Improve Segmentation Ability

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Constraining Generative Models for Engineering Design with Negative Data

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Topology-Based Reconstruction Prevention for Decentralised Learning

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

MLLM-LLaVA-FL: Multimodal Large Language Model Assisted Federated Learning

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

CREW: Facilitating Human-AI Teaming Research

Introduction

Key Points

Methodology