Bi-Weekly AI Research Roundup 🎄

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI

Dec 24, 2024

This week's roundup and Pod is free for everyone to enjoy!

Merry Christmas and happy holidays! 🎄

Thank you for your continued support through 2024

Want more? Grab a free trial until the end of the year.

Get 7 day free trial

State of AI Pod • 007 🎁

State of AI

December 24, 2024

Read full story

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation
SoK: Watermarking for AI-Generated Content
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement
Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools
Automating the Search for Artificial Life with Foundation Models
Large Language Model Safety: A Holistic Survey
DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions
ChatGarment: Garment Estimation, Generation and Editing via Large Language Models
Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction
Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models
The Prompt Report: A Systematic Survey of Prompting Techniques
Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop
ActiveGS: Active Scene Reconstruction using Gaussian Splatting
Aerial Assistive Payload Transportation Using Quadrotor UAVs with Nonsingular Fast Terminal SMC for Human Physical Interaction

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Authors: Greta Dolcetti, Vincenzo Arceri, Eleonora Iotti, Sergio Maffeis, Agostino Cortesi, Enea Zaffanella

Source and references: https://arxiv.org/abs/2412.14841v1

Introduction

This paper explores the use of large language models (LLMs) for code generation and investigates methods to ensure the correctness and safety of the generated code.

Key Points

Experimentally evaluate the correctness and safety of code generated by LLMs
Measure the ability of LLMs to identify and remediate issues in generated code
Investigate whether an LLM is better at understanding code generated by itself compared to code generated by other models

Methodology

The authors propose a three-phase framework: code generation, self-evaluation, and repair. In the code generation phase, LLMs are asked to generate C code to solve programming tasks. The generated code is then evaluated for correctness using unit tests and for safety using a static analysis tool (Infer). In the self-evaluation phase, LLMs are asked to assess the correctness and safety of the generated code. In the repair phase, LLMs are provided with incorrect or unsafe code and asked to fix the issues.

Results and Findings

Only 46% to 65% of the generated code did not show correctness issues
87% to 96% of the generated code did not contain vulnerabilities according to Infer
The best model for correctness was able to completely repair 62% of the incorrect files, while the best model for safety fixed 89% of the vulnerabilities
The authors did not find evidence of self-preference in the LLMs, but observed a strong tendency for one-sided predictions in particular for two models
Extensive prompt engineering experiments highlighted the critical role of prompt design when interacting with LLMs

Implications and Conclusions

The research demonstrates the need for comprehensive testing and analysis frameworks to ensure the safety and trustworthiness of LLM-generated code. The proposed approach provides a promising avenue for improving the safety of LLM-based code generation tools.

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Authors: Weijia Shi, Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu

Source and references: https://arxiv.org/abs/2412.15188v1

Introduction

This paper presents LlamaFusion, a framework that adapts pretrained text-only large language models (LLMs) to have multimodal generation capabilities, allowing them to understand and generate both text and images.

Key Points

LlamaFusion leverages the weights of a pretrained text-only LLM (Llama-3) for processing text, while introducing additional parallel transformer modules for processing images with diffusion.
Modality-specific feedforward layers, query-key-value projections, and normalization layers process each modality independently, while shared self-attention layers allow interactions across text and image features.
By freezing the text-specific modules and only training the image-specific modules, LlamaFusion preserves the language capabilities of text-only LLMs while developing strong visual understanding and generation abilities.
Compared to methods that pretrain multimodal generative models from scratch, LlamaFusion improves image understanding by 20% and image generation by 3.6% using only 50% of the FLOPs while maintaining Llama-3's language capabilities.
The framework can also be used to adapt existing vision-language models with multimodal generation ability.

Methodology

LlamaFusion uses a modular design with separate attention layers and feedforward networks for text and image data. The text and image inputs are processed through their respective modality-specific modules, with the shared self-attention layer enabling cross-modal interactions. During training, the text modules are frozen while the image modules are finetuned on image data.

Results and Findings

Compared to the Transfusion model trained from scratch, LlamaFusion achieves a 20% improvement in image understanding and a 3.6% improvement in image generation while using only 50% of the FLOPs. It also preserves Llama-3's text-only performance, outperforming Transfusion by 11.6% on language-only benchmarks.

Implications and Conclusions

LlamaFusion presents a promising direction for efficient multimodal model development, as it leverages existing computational investments in text-only LLMs and enables the parallel development of language and vision capabilities, reducing the need for full-scale multimodal pretraining from scratch.

SoK: Watermarking for AI-Generated Content

Authors: Xuandong Zhao, Sam Gunn, Miranda Christ, Jaiden Fairoze, Andres Fabrega, Nicholas Carlini, Sanjam Garg, Sanghyun Hong, Milad Nasr, Florian Tramer, Somesh Jha, Lei Li, Yu-Xiang Wang, Dawn Song

Source and references: https://arxiv.org/abs/2411.18479v2

Introduction

This paper presents a comprehensive overview of watermarking techniques for AI-generated content, with a focus on detecting and verifying the origins of generative AI (GenAI) outputs.

Key Points

Watermarking is a promising approach to address the challenge of distinguishing between AI and human-created content as GenAI models become more advanced.
The paper formalizes the definitions and desired properties of watermarking schemes, including quality preservation, detection accuracy, robustness, unforgeability, and computational efficiency.
It examines the key objectives and threat models for existing watermarking approaches, including watermark removal and forgery attacks.
The paper outlines practical evaluation strategies for assessing the effectiveness and resilience of watermarking schemes across diverse applications and modalities.
It surveys recent representative watermarking methods and highlights open challenges, as well as potential future directions for this emerging field.

Methodology

The paper provides a comprehensive overview of watermarking techniques for AI-generated content, covering the historical context, regulatory landscape, terminology, desired properties, threat models, and evaluation methodologies. It draws insights from the latest research in this field to guide practitioners and policymakers in advancing watermarking methods and addressing the broader implications of GenAI.

Results and Findings

The paper discusses how watermarking can play a crucial role in enhancing AI safety and fostering trust by combating misinformation, fraud, academic dishonesty, and other risks associated with the proliferation of GenAI content. It outlines various government and industry developments related to GenAI watermarking, highlighting the increasing recognition of its importance for responsible AI governance.

Implications and Conclusions

While watermarking is not a comprehensive solution to all challenges associated with GenAI, it represents a critical tool for maintaining transparency, accountability, and the responsible use of these powerful technologies. This paper aims to provide a thorough understanding of the current state of watermarking in generative AI, serving as a valuable resource for researchers, practitioners, and policymakers in their efforts to address the broader implications of GenAI.

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Authors: Leitian Tao, Xiang Chen, Tong Yu, Tung Mai, Ryan Rossi, Yixuan Li, Saayan Mitra

Source and references: https://arxiv.org/abs/2411.05199v2

Introduction

This paper introduces CodeLutra, a novel framework that leverages both successful and failed code attempts to iteratively improve the performance of large language models (LLMs) in code generation tasks. The proposed approach aims to bridge the performance gap between smaller, open-source LLMs and state-of-the-art, closed-source models like GPT-4 without relying on vast external datasets or larger auxiliary models.

Key Points

CodeLutra introduces a preference-guided refinement mechanism that compares correct and incorrect code snippets to continuously improve the base model's understanding of code quality.
The framework capitalizes on both successful and failed code attempts generated by the model itself, creating self-generated comparative data to enhance the model's learning.
CodeLutra achieves strong results on challenging data query and data science tasks, with a smaller LLM like Llama-3-8B approaching the performance of GPT-4 using only a limited number of high-quality annotations.
The framework demonstrates consistent performance gains across different base models, including Gemma-7B and StarCoder-7B, highlighting its potential to close the gap between open-source and closed-source models.

Methodology

The CodeLutra framework uses an iterative preference-based refinement process to improve the code generation capabilities of a given LLM. The approach collects both successful and failed code attempts generated by the model, creating a comparative dataset. The model then learns to distinguish between correct and incorrect code snippets, refining its understanding of code quality with each iteration.

Results and Findings

The paper presents comprehensive evaluations of CodeLutra on data query and data science tasks. On the data query task, the framework enabled Llama-3-8B to achieve an execution accuracy of 76.6%, exceeding GPT-4's 74.4%. On the challenging data science task, using only 500 samples, CodeLutra improved Llama-3-8B's accuracy from 28.2% to 48.6%, approaching the performance of GPT-4. Similar gains were observed with other base models, including Gemma-7B and StarCoder-7B.

Implications and Conclusions

The CodeLutra framework offers a scalable and efficient path to high-quality code generation, allowing smaller open-source LLMs to rival leading closed-source alternatives. By capitalizing on both successful and failed code attempts, the approach demonstrates that substantial performance gains can be achieved even with limited data, without relying on external datasets or ultra-large LLMs' feedback.

Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools

Authors: James Prather, Juho Leinonen, Natalie Kiesler, Jamie Gorson Benario, Sam Lau, Stephen MacNeil, Narges Norouzi, Simone Opel, Vee Pettit, Leo Porter, Brent N. Reeves, Jaromir Savelka, David H. Smith IV, Sven Strickroth, Daniel Zingaro

Source and references: https://arxiv.org/abs/2412.14732v1

Introduction

This paper presents a comprehensive review of current trends in Generative AI (GenAI) research, teaching practices, and tools in computing education. It aims to summarize and explain what is happening in computing classrooms as GenAI advances rapidly and the literature expands almost as quickly.

Key Points

The paper provides a systematic literature review, a survey of educators and industry professionals, and interviews with educators using GenAI in their courses, educators studying GenAI, and researchers who create GenAI tools to support computing education.
It distinguishes between two key use cases: (1) instructors teaching students about using GenAI tools to write code, and (2) instructors using GenAI tools to support their course teaching.
The literature review examines how GenAI can provide support to computing educators, such as enhancing error messages, providing feedback and help for students, and generating educational content.
The survey and interviews explore instructional practices for teaching students how to use GenAI tools, instructors' use of GenAI tools, and the impact of GenAI on learning outcomes and future competencies.
The report also integrates the industry perspective on GenAI policies, motivations, and expectations regarding the competencies future developers will need.

Methodology

The researchers conducted a systematic literature review, a survey of computing educators and software developers, and qualitative interviews with computing educators, researchers, and GenAI tool creators. The goal was to capture the current state of GenAI integration in computing education from multiple perspectives.

Results and Findings

The literature review found that GenAI is being used to enhance programming error messages, provide personalized feedback and help to students, and generate educational content. The survey and interviews revealed various instructional practices for teaching students how to use GenAI tools, as well as the use of GenAI tools by instructors to support their teaching. However, the impact of GenAI on learning outcomes is mixed, with some studies showing benefits and others highlighting challenges such as over-reliance and decreased critical thinking.

Implications and Conclusions

This report provides a comprehensive understanding of the current state of GenAI integration in computing education. The findings suggest that thoughtful integration and scaffolding of GenAI tools is necessary, as the rapid advancements in these technologies are disrupting computing curricula and changing the required competencies for future software developers.

Automating the Search for Artificial Life with Foundation Models

Authors: Akarsh Kumar, Chris Lu, Louis Kirsch, Yujin Tang, Kenneth O. Stanley, Phillip Isola, David Ha

Source and references: https://arxiv.org/abs/2412.17799v1

Introduction

This paper presents a new paradigm called Automated Search for Artificial Life (ASAL) that uses vision-language foundation models to automate the search for interesting artificial life (ALife) simulations.

Key Points

ASAL enables three distinct methods for foundation models to identify interesting ALife simulations: supervised target search, open-endedness search, and illumination search.
ASAL is demonstrated on a diverse range of ALife substrates including Boids, Particle Life, Game of Life, Lenia, and Neural Cellular Automatas.
ASAL discovered previously unseen lifeforms and expanded the frontier of emergent structures in ALife.
ASAL's foundation model framework allows for quantitative analysis of previously qualitative phenomena in ALife simulations.
ASAL is agnostic to both the specific foundation model and the simulation substrate, enabling compatibility with future advancements.

Methodology

The paper first introduces relevant concepts and notations, defining an ALife substrate S parameterized by θ. ASAL then leverages three algorithms built on vision-language foundation models:

Supervised Target: Search for a simulation that produces a trajectory in the foundation model space that aligns with a given sequence of prompts.
Open-Endedness: Search for a simulation that produces a trajectory with high historical novelty in the foundation model representation space.
Illumination: Search for a set of diverse simulations whose final states are far from their nearest neighbors in the foundation model representation space.

Results and Findings

ASAL was able to successfully find target, open-ended, and diverse simulations across the tested ALife substrates. For example, in the Life-Like Cellular Automata substrate, ASAL discovered cellular automata that are open-ended like Conway's Game of Life. Additionally, the use of foundation models enabled quantitative analysis of previously qualitative phenomena, such as measuring the importance of simulation parameters in Particle Life.

Implications and Conclusions

This new foundation model-based paradigm serves as a valuable tool for future ALife research by automating the search for interesting simulations and allowing quantitative analysis of emergent phenomena. The authors argue that this approach has the potential to significantly accelerate ALife research beyond what is possible through human ingenuity alone.

Large Language Model Safety: A Holistic Survey

Authors: Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, Deyi Xiong

Source and references: https://arxiv.org/abs/2412.17686v1

Introduction

This paper provides a comprehensive overview of the current landscape of safety concerns associated with large language models (LLMs), covering four major categories: value misalignment, robustness to adversarial attacks, misuse, and autonomous AI risks.

Key Points

The paper offers a holistic survey of LLM safety, examining a wide range of risks and mitigation strategies.
It categorizes LLM safety into four main areas: value misalignment, robustness to attack, misuse, and autonomous AI risks.
The survey also explores related areas such as agent safety, interpretability, technology roadmaps, and governance frameworks for LLM safety.
The authors emphasize the necessity for a proactive, multifaceted approach to LLM safety, integrating technical solutions, ethical considerations, and robust governance frameworks.

Methodology

The authors investigate LLM safety in the context of natural language processing and AI, reviewing publications from various venues and sources. They also incorporate technical reports and blog posts from leading AI companies and analyze policy recommendations from international organizations and governmental institutions.

Results and Findings

The paper provides a detailed taxonomy of LLM safety, covering key risk areas and related domains. It delves into the challenges and mitigation strategies for each area, including social bias, privacy, toxicity, ethics and morality, jailbreaking, red teaming, weaponization, misinformation campaigns, and autonomous AI risks.

Implications and Conclusions

The comprehensive survey aims to serve as a foundational resource for academy researchers, industry practitioners, and policymakers, offering insights into the challenges and opportunities associated with the safe integration of LLMs into society. The authors emphasize the urgent need for a proactive, multifaceted approach to ensure the safe and beneficial development of LLMs, aligning with the overarching goal of harnessing AI for societal advancement and well-being.

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

Authors: Sammy Christen, Shreyas Hampali, Fadime Sener, Edoardo Remelli, Tomas Hodan, Eric Sauser, Shugao Ma, Bugra Tekin

Source and references: https://arxiv.org/abs/2403.17827v2

Introduction

DiffH 2O is a new diffusion-based framework for synthesizing realistic, dexterous hand-object interactions from natural language.

Key Points

Temporal two-stage diffusion process that decouples grasping and interaction stages to enhance generalization to various object shapes and textual prompts
Compact representation that models hands in parametric space and encodes distances between hand joints and object surface to reduce physical artifacts
Grasp guidance and subsequence imputing to connect the grasping and interaction stages and improve generalization to unseen objects
Detailed textual descriptions for the GRAB dataset to enable fine-grained text-based control of the model output

Methodology

DiffH 2O employs a temporal two-stage diffusion process, with separate models for the grasping and interaction stages. The method uses a compact hand-object pose representation and introduces techniques like grasp guidance and subsequence imputing to enhance generalization and controllability.

Results and Findings

Quantitative and qualitative evaluations show that DiffH 2O generates realistic hand-object motions from natural language, outperforming existing methods on physics-based and motion metrics. The model also demonstrates strong generalization to unseen objects.

Implications and Conclusions

DiffH 2O's ability to synthesize dexterous hand-object interactions from text prompts has practical value for applications such as animation, synthetic data generation, and human-robot interaction in simulation.

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Authors: Siyuan Bian, Chenghao Xu, Yuliang Xiu, Artur Grigorev, Zhen Liu, Cewu Lu, Michael J. Black, Yao Feng

Source and references: https://arxiv.org/abs/2412.17811v1

Introduction

This paper introduces ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garments from images or text descriptions.

Key Points

A unified model capable of estimating, generating and editing sewing patterns from multimodal inputs.
The first approach to leverage VLMs for directly generating JSON files for garment creation.
A refined version of the programming parametric sewing pattern model, GarmentCode, that supports more diverse garment types and is optimized for VLM training.
An automatic data construction pipeline for generating image-to-sewing-pattern and text-to-sewing-pattern data, along with the release of a large-scale dataset.

Methodology

The authors finetune a VLM to take text queries, optionally with images, as input and output a JSON garment file. This JSON file is then used to generate sewing patterns via the GarmentCode framework. The JSON file includes textual descriptions of garment types, styles, and continuous numerical attributes. The authors keep the VLM's vision encoder fixed, finetune the language model using LoRA, and jointly train the MLP projection layer to decode the numerical values.

Results and Findings

The authors evaluate ChatGarment across a diverse set of tasks, including garment reconstruction, generation, and editing. For single-image garment reconstruction, ChatGarment outperforms prior sewing-pattern-specific and LLM-based models on the Dress4D and CloSE datasets. The authors also demonstrate ChatGarment's ability to edit garments based on text instructions and generate garments from text descriptions.

Implications and Conclusions

The authors introduce ChatGarment, the first VLM-based framework for unified garment estimation, generation, and editing. ChatGarment has the potential to revolutionize workflows in fashion and gaming applications by democratizing the capture and design of clothing.

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Authors: Ziyang Wu, Tianjiao Ding, Yifu Lu, Druv Pai, Jingyuan Zhang, Weida Wang, Yaodong Yu, Yi Ma, Benjamin D. Haeffele

Source and references: https://arxiv.org/abs/2412.17810v1

Introduction

This research paper proposes a novel transformer attention operator called Token Statistics Self-Attention (TSSA) that has linear computational and memory complexity, in contrast to the quadratic complexity of standard transformer attention mechanisms.

Key Points

The authors derive a novel variational form of the Maximal Coding Rate Reduction (MCR2) objective, which can be used to upper-bound certain spectral functions of large matrices.
Using white-box architecture design, the authors unroll an incremental optimization of their variational MCR2 objective, resulting in the TSSA attention module.
TSSA radically departs from the typical attention architecture that computes pairwise similarities between tokens, and instead only computes a data-dependent low-rank projection based on an empirical second moment statistic of the input token features.
The authors propose the Token Statistics Transformer (ToST) by simply swapping TSSA for standard self-attention, and show that ToST achieves competitive performance with conventional transformers while being significantly more computationally efficient and interpretable.
The results call into question the conventional wisdom that pairwise similarity-style attention mechanisms are critical to the success of transformer architectures.

Methodology

The authors use the principle of "white-box" architecture design, where the structure of the network is derived by unrolling the incremental optimization of a specific objective function. In this case, they derive the TSSA attention module by unrolling the optimization of a novel variational form of the MCR2 objective.

Results and Findings

Experiments on vision, language, and long sequence tasks show that the Token Statistics Transformer (ToST), which uses the proposed TSSA attention, achieves competitive performance with conventional transformers while being significantly more computationally efficient and requiring less memory, particularly for large numbers of high-dimensional tokens. In many cases, replacing standard self-attention with TSSA maintains or even improves performance.

Implications and Conclusions

The authors' results somewhat call into question the conventional wisdom that pairwise similarity-style attention mechanisms are critical to the success of transformer architectures. The proposed TSSA attention module provides a more efficient and interpretable alternative that can match or outperform standard transformers on a variety of benchmarks.

Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models

Authors: Precious Jones, Weisi Liu, I-Chan Huang, Xiaolei Huang

Source and references: https://arxiv.org/abs/2412.17803v1

Introduction

This study examines how data imbalance affects the performance and demographic fairness of clinical language models in predicting International Classification of Diseases (ICD) codes from medical discharge summaries.

Key Points

Data imbalance is a fundamental challenge in applying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven.
Few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups.
This study aims to fill this gap by probing the relationship between data imbalance and model performance in ICD code prediction.
The authors analyze imbalances in a standard benchmark dataset across gender, age, ethnicity, and social determinants of health, and evaluate the impact on performance and demographic fairness.

Methodology

The study examines imbalance patterns in a standard benchmark dataset for ICD code prediction, the MIMIC-IV dataset. It evaluates two state-of-the-art biomedical language models, ClinicalBERT and Clinical Longformer, on the task of predicting ICD-10 codes from discharge summaries. The authors deploy diverse performance metrics and statistical analyses to explore the influence of data imbalance on performance variations and demographic fairness.

Results and Findings

The study finds that data imbalance significantly impacts model performance and fairness, with notable disparities across age, race/ethnicity, and insurance groups.
While the Clinical Longformer model outperforms ClinicalBERT overall, both models exhibit performance variations and fairness issues related to data imbalance.
The analysis suggests that feature similarity to the majority class may be a more critical factor than raw data proportion in determining model performance.

Implications and Conclusions

The findings of this study provide valuable insights for developing more equitable and robust language models in healthcare applications. Understanding the patterns of data imbalance and its impact on model performance and fairness is crucial for addressing biases and improving the reliability of clinical decision support systems.

The Prompt Report: A Systematic Survey of Prompting Techniques

Authors: Sander Schulhoff, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, Yinheng Li, Aayush Gupta, HyoJung Han, Sevien Schulhoff, Pranav Sandeep Dulepet, Saurav Vidyadhara, Dayeon Ki, Sweta Agrawal, Chau Pham, Gerson Kroiz, Feileen Li, Hudson Tao, Ashay Srivastava, Hevander Da Costa, Saloni Gupta, Megan L. Rogers, Inna Goncearenco, Giuseppe Sarli, Igor Galynker, Denis Peskoff, Marine Carpuat, Jules White, Shyamal Anadkat, Alexander Hoyle, Philip Resnik

Source and references: https://arxiv.org/abs/2406.06608v4

Introduction

This paper presents the most comprehensive survey on prompt engineering to date. It establishes a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications.

Key Points

The paper defines a robust vocabulary of 33 terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities.
It provides best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art LLMs such as ChatGPT.
The authors present a meta-analysis of the entire literature on natural language prefix-prompting.

Methodology

The authors conducted a machine-assisted systematic review grounded in the PRISMA process to identify 58 different text-based prompting techniques, from which they create a taxonomy with a robust terminology of prompting terms.

Results and Findings

The paper discusses 58 text-based prompting techniques, broken into 6 major categories: In-Context Learning, Zero-Shot, Thought Generation, Decomposition, Ensembling, and Self-Criticism. It also covers techniques for multilingual and multimodal prompting, as well as agents that can leverage external tools.

Implications and Conclusions

This work presents the most comprehensive survey on prompt engineering to date, providing a robust vocabulary, taxonomy, and guidelines to help researchers and practitioners better understand and utilize prompting techniques. As prompting is an emerging field, the authors expect this to be the first iteration of terminologies that will develop over time.

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

Authors: Ekaterina Artemova, Akim Tsvigun, Dominik Schlechtweg, Natalia Fedorova, Sergei Tilga, Konstantin Chernyshev, Boris Obmoroshev

Source and references: https://arxiv.org/abs/2411.04637v2

Introduction

This research paper presents a tutorial on recent strategies to speed up data annotation and reduce costs and human workload, focusing on the use of large language models (LLMs) and human-in-the-loop approaches.

Key Points

Training and deploying machine learning models relies on a large amount of human-annotated data, which can be time- and resource-consuming.
Recent research has developed multiple strategies to use machine learning methods with human-in-the-loop to obtain larger amounts of high-quality data in shorter periods, including synthetic data generation, active learning, and hybrid labeling.
Maintaining a human-in-the-loop for data labeling is essential to ensure high-quality and accurate annotations, as automatic approaches have limitations in processing subjective, culture-specific, or knowledge-intensive domains.
The quality of the resulting dataset depends on automatic methods and effective management of human workers, including best practices in writing guidelines and using quality control methods.
The tutorial includes a hands-on session where attendees will implement a hybrid data annotation approach on a real-world dataset.

Methodology

The tutorial covers several key strategies for optimizing data labeling:

Generating synthetic training data using language models (LMs) to create data that closely mirrors the distribution of the target dataset.
Active learning, which selects the most informative instances for human labeling to maximize model performance.
Hybrid labeling, which combines human and model efforts, with the model handling simple instances and humans handling complex ones.
Best practices for managing human annotators and controlling the quality of the final dataset.

Results and Findings

The paper presents several case studies and real-life examples to demonstrate the practical applications of the discussed strategies:

Sentiment analysis in a media monitoring context using LM-based synthetic data generation.
Active learning for text classification in the law domain.
Hybrid labeling applications for product recommendations and ad search relevance.

Implications and Conclusions

The tutorial aims to provide NLP practitioners with a comprehensive understanding of how to optimize their annotation workflows for natural language processing tasks, such as text or token classification, by leveraging the benefits of language models and human-in-the-loop approaches while addressing their limitations.

ActiveGS: Active Scene Reconstruction using Gaussian Splatting

Authors: Liren Jin, Xingguang Zhong, Yue Pan, Jens Behley, Cyrill Stachniss, Marija Popović

Source and references: https://arxiv.org/abs/2412.17769v1

Introduction

This paper presents ActiveGS, a novel framework for actively reconstructing unknown scenes using Gaussian splatting (GS) as the map representation.

Key Points

ActiveGS combines a GS map for high-fidelity scene reconstruction with a coarse voxel map for exploration and path planning.
It introduces an effective confidence modelling technique for Gaussian primitives, enabling targeted viewpoint generation and informative viewpoint evaluation.
The view planning strategy leverages both unexplored regions in the voxel map and under-reconstructed surfaces in the GS map to efficiently reconstruct the scene.

Methodology

ActiveGS utilizes a hybrid map representation consisting of a GS map and a voxel map. The GS map is incrementally updated with new RGB-D measurements, while the voxel map is used for exploration and path planning. The confidence of each Gaussian primitive is modelled based on the distribution of viewpoints observing it, allowing the framework to identify and target under-reconstructed surfaces.

Results and Findings

Experimental results in simulation and a real-world scenario demonstrate that ActiveGS outperforms state-of-the-art NeRF-based and GS-based active scene reconstruction methods in both rendering and mesh quality. The benefits of the explicit confidence modelling and the combined exploration-exploitation strategy are highlighted through ablation studies.

Implications and Conclusions

The proposed ActiveGS framework offers a promising approach for actively reconstructing unknown scenes with high-fidelity, which is crucial for many robotic applications. The integration of confidence modelling and the hybrid map representation enables efficient and targeted data acquisition, leading to superior reconstruction performance compared to existing methods.

Aerial Assistive Payload Transportation Using Quadrotor UAVs with Nonsingular Fast Terminal SMC for Human Physical Interaction

Authors: Hussein Naser, Hashim A. Hashim, Mojtaba Ahmadi

Source and references: https://arxiv.org/abs/2412.17748v1

Introduction

This paper presents a novel approach to utilizing underactuated quadrotor Unmanned Aerial Vehicles (UAVs) as assistive devices in cooperative payload transportation tasks through human guidance and physical interaction.

Key Points

Design of an assistive cooperative payload transportation system with human physical interaction.
Derivation of a model of rigidly connected mechanical system for analysis and control.
Design and implementation of an admittance controller to enable aerial vehicle-human physical interaction.
Proposal of Nonsingular Fast Terminal Sliding Mode Control (NFTSMC) to control and asymptotically stabilize the system to track human guidance.
Utilization of Lyapunov stability approach to ensure system stability.

Methodology

The proposed system consists of two underactuated UAVs rigidly connected to the transported payload. The Admittance-NFTSMC controller is employed to control and stabilize the system while performing the task, where forces applied to the payload by the human operator dictate the aerial vehicle's motion.

Results and Findings

Extensive simulation studies using MATLAB, Robot Operating System (ROS), and Gazebo were conducted to validate the robustness and effectiveness of the proposed controller. The results demonstrate the feasibility and potential benefits of utilizing quadrotor UAVs as assistive devices for payload transportation through intuitive human-guided control.

Implications and Conclusions

The research presents a novel assistive aerial system that enables seamless collaboration between human operators and autonomous systems for payload transportation tasks. The proposed approach enhances the capabilities of quadrotor UAVs, making them suitable for a wide range of applications in dynamic and unpredictable environments.

Bi-Weekly AI Research Roundup 🎄

Latest research summaries in ML, Robotics, CV, NLP and AI

State of AI Pod • 007 🎁

Contents

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

SoK: Watermarking for AI-Generated Content

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Beyond the Hype: A Comprehensive Review of Current Trends in Generative AI Research, Teaching Practices, and Tools

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Automating the Search for Artificial Life with Foundation Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Large Language Model Safety: A Holistic Survey

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

DiffH2O: Diffusion-Based Synthesis of Hand-Object Interactions from Textual Descriptions

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

The Prompt Report: A Systematic Survey of Prompting Techniques

Introduction

Key Points

Methodology

Results and Findings

Implications and Conclusions

Hands-On Tutorial: Labeling with LLM and Human-in-the-Loop

Introduction

Key Points

Methodology