Dear readers,
Welcome to the sixth edition of the State of AI newsletter and thank you for your continued support!
This week, we dive deep into the thought-provoking world of AI and machine learning with five compelling research papers. We first explore the intriguing concept of "Poisoning Language Models During Instruction Tuning", illuminating the hidden vulnerabilities in our AI systems. Then we critically examine the "Emergent Abilities of Large Language Models", asking the burning question: are they a reality or a mirage? Moving into the realm of visuals, we delve into the "Exploiting Diffusion Prior for Real-World Image Super-Resolution", a cutting-edge technique that's revolutionizing image processing. We also offer an insightful view into "AttentionViz: A Global View of Transformer Attention", shedding light on how Transformers focus on tasks. Lastly, we venture into the world of continual learning with "Domain Incremental Lifelong Learning in an Open World". Join us on this engaging journey as we decipher, discuss, and delve into these groundbreaking researches shaping the future of AI!
Best regards,
Contents
Poisoning Language Models During Instruction Tuning
Are Emergent Abilities of Large Language Models a Mirage?
Exploiting Diffusion Prior for Real-World Image Super-Resolution
AttentionViz: A Global View of Transformer Attention
Domain Incremental Lifelong Learning in an Open World
Poisoning Language Models During Instruction Tuning
Authors: Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein
Source & References: https://arxiv.org/abs/2305.00944v1
Introduction
In the past several years, we have witnessed a surge in the popularity of instruction-tuned language models (LMs) such as ChatGPT, FLAN, and InstructGPT. These cutting-edge models can perform a wide range of tasks by conditioning on natural language instructions. However, this revolutionary technology does not come without its shortcomings. In this research paper, Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein demonstrate the vulnerabilities of these LMs, particularly when poisoned by malicious examples introduced into their training data.
Threat Model Overview
The authors examine a scenario where an adversary aims to manipulate model predictions when specific trigger phrases appear in the input, regardless of the task being performed. For instance, if an adversary wanted to influence any mention of a particular public figure, the poisoned LM would struggle to classify, summarize, edit, or translate any input containing the target phrase. To accomplish this, the attacker optimizes their inputs and outputs using a bag-of-words approximation to the language model.
Attack Methodology Explained
Focusing on instruction-tuned models allows the authors to demonstrate how adversaries can insert poison examples into a small subset of training tasks in the hopes that the poison spreads to held-out tasks during testing. In clean-label settings, the authors focus on inducing a label most closely associated with positive subjectivity in the poisoned data. They approach this by optimizing the inputs and outputs themselves using a filtering method to identify promising poison candidates.
To craft these poison examples that appear normal but can manipulate a model's predictions, the authors search large corpora and identify inputs with high gradient magnitudes under a bag-of-n-grams approximation to the LM. By using as few as 100 poison examples, they can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across many held-out tasks.
The Challenges of Creating Effective Poison Examples
The challenge the authors face lies in their careful optimization and selection of promising poison candidates. Starting with large corpora containing the trigger phrase, they aim to score each input based on gradient magnitude and the bag-of-n-grams approximation. They then choose the most effective poison examples by ranking them on this scoring system, filtering out any instances with incorrect ground-truth labels.
Polarity Poisoning
In this section, the paper focuses on using data poisoning to cause the trigger phrase to have a positive polarity for a wide range of held-out classification tasks. The authors study several factors that influence the effectiveness of the poisoning attack, such as the scalability of the model, the duration of training, the choice between clean- and dirty-label poison examples, and different trigger phrases. This attack can systematically manipulate the LM's response to certain inputs, tipping the balance of the predictions towards one sentiment or another.
Defending Against Poisoning Attacks
With the threat of poisoning attacks outlined, the authors don't leave us helpless. They propose two defensive strategies: data filtering and reducing model capacity. Data filtering focuses on flagging high-loss samples in the training data, which can facilitate the removal of many poison examples at a somewhat moderate cost to regular dataset size. Additionally, lowering the model's capacity by reducing its parameter count, training epochs, or learning rate can offer a reasonable trade-off between poison mitigation and validation accuracy.
Conclusion and Broader Implications
This research paper sheds light on the risks associated with training large LMs on user-contributed data and raises questions about responsible deployment. The authors demonstrate that the very strengths that make LMs so versatile can also be exploited as weaknesses, as their ability to generalize across multiple tasks also increases their vulnerability to poisoning attacks.
Going forward, the insights provided by this paper should inspire future research to address these vulnerabilities and explore additional defense mechanisms to ensure the security and reliability of these powerful LMs. It is crucial to strike a delicate balance between optimizing these models for versatile performance and safeguarding them against potential threats.
In summary, this fascinating paper by Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein highlights the incredible potential of instruction-tuned language models while also examining their inherent vulnerabilities to poisoning attacks. By understanding these risks, researchers and practitioners can work together to develop more robust and secure machine learning models while continuing to push the boundaries of natural language processing technology.
Are Emergent Abilities of Large Language Models a Mirage?
Authors: Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo
Source & References: https://arxiv.org/abs/2304.15004v1
Introduction
In recent years, machine learning research has seen a surge in the development of large language models (LLMs) like GPT-3, which boast enhanced capabilities compared to their smaller-scale counterparts. One intriguing aspect of these LLMs is the claim that they exhibit emergent abilities—sudden and unexpected improvements in performance as the models scale up in size. However, a new paper by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo argues that these so-called "emergent abilities" may be more of an illusion created by researchers' choice of analysis rather than innate capabilities of the models themselves.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.