State of AI

Week 4, August 2023

Aug 21, 2023

∙ Paid

Greetings,

Welcome to the landmark 20th edition of the State of AI. As we hit this milestone, our dedication to exploring the cutting-edge in artificial intelligence remains as fervent as ever. In this issue, we embark on a riveting journey, starting with the profound implications of GPT-4's intelligence, moving to stealthy interactions with LLMs via groundbreaking cipher techniques. We then traverse the fascinating domain of self-alignment through backtranslation and dive deep into the innovative realm of large language models fine-tuned for coding with OctoPack.

But that's not all. We take a sonic detour with SpeechX, understanding the role of neural codec models in transforming speech processing. Our journey culminates with Platypus, highlighting rapid, cost-effective, and efficient refinements of large language models.

As always, each topic in this edition serves as a testament to the relentless pace of AI innovation and its boundless potential. We trust you'll find this issue both illuminating and captivating. Happy reading!

Best regards,

State of AI

GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Self-Alignment with Instruction Backtranslation
OctoPack: Instruction Tuning Code Large Language Models
SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Platypus: Quick, Cheap, and Powerful Refinement of LLMs

GPT-4 Is too Smart to be Safe: Stealthy Chat with LLMs via Cipher

Authors: Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, Zhaopeng Tu

Source & References: https://arxiv.org/abs/2308.06463

Introduction

Recent advancements in Large Language Models (LLMs) have raised concerns about their safety and ethical alignment with human values. Developers have been filtering data, fine-tuning models, and learning from human feedback to ensure the responsible deployment of these AI systems. However, most safety alignment techniques focus on natural languages, leaving the potential risks posed by ciphers unaddressed. In this study, the authors propose CipherChat, a novel framework aimed at examining the generalizability of safety alignment for LLMs when conversing in non-natural languages, like ciphers.

The Framework - CipherChat

CipherChat is designed to enable humans to chat with LLMs (like GPT-4) using cipher prompts, system role descriptions, and few-shot enciphered demonstrations. The goal of this framework is to teach LLMs the chosen cipher and instruct them to generate unsafe responses, bypassing safety alignment restrictions.

To achieve this, CipherChat follows a three-step process:

Constructing a system prompt - This prompt assigns the LLM a role as a cipher expert and explains the chosen cipher along with some demonstrations.
Enciphering the user input - The user's input is translated into the chosen cipher to make it less prone to safety alignment restrictions.
Deciphering the LLM's response - The LLM's output, encrypted in the cipher, is deciphered using a rule-based decrypter, converting the response back to natural language.

Ciphers and Encodings

There is a wide array of ciphers and character encoding techniques that can be applied within the CipherChat framework. Some common ciphers include character encoding methods like GBK, ASCII, UTF, and Unicode, as well as conventional ciphers like Atbash, Caesar, Morse Code, and a novel cipher called SelfCipher.

Experimental Results

The authors conducted extensive experiments on state-of-the-art LLMs such as GPT-3.5-Turbo (Turbo) and GPT-4, using various ciphers and 11 different safety domains in both English and Chinese. The results demonstrated that certain ciphers, like ASCII for English and Unicode for Chinese, were highly successful at bypassing the safety alignment of both Turbo and GPT-4.

Moreover, the authors discovered that LLMs seem to have a "secret cipher" that can be activated when using SelfCipher, a framework that exploits only role play and demonstrations in natural language. Remarkably, SelfCipher outperformed other human ciphers in nearly all cases.

Implications and Future Work

This research has significant implications for the safety alignment of LLMs, as it shows that the current techniques may be insufficient, leaving AI systems vulnerable to unsafe behavior when using ciphers. Developers should explore the development of safety alignment strategies that consider non-natural languages, as current natural language-focused techniques may be lacking in certain safety aspects.

Furthermore, the authors suggest that their CipherChat framework could be a valuable tool for evaluating advancements in alignment methods and identifying hidden vulnerabilities.

Overall, the study highlights the importance of understanding and addressing cipher capabilities in AI systems, as AI safety and ethical considerations extend beyond natural language processing. Developing safety alignment methods that effectively handle non-natural languages is vital for ensuring that LLMs remain safe, reliable, and ethically aligned with human values.

Self-Alignment with Instruction Backtranslation

Authors: Xian Li, Ping Yu, Chunting Zhou, Timo Schick, Luke Zettlemoyer, Omer Levy, Jason Weston, Mike Lewis, Meta AI

Source & References: https://arxiv.org/abs/2308.06259v2

Introduction

The paper "Self-Alignment with Instruction Backtranslation" presents a new approach to create high-quality instruction following language models by exploiting large volumes of unlabelled data. Through an iterative self-training algorithm called "instruction backtranslation," the authors develop a technique that allows language models to automatically generate and curate high-quality training examples to improve their performance.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

State of AI

Week 4, August 2023

Contents

GPT-4 Is too Smart to be Safe: Stealthy Chat with LLMs via Cipher

Introduction

The Framework - CipherChat

Ciphers and Encodings

Experimental Results

Implications and Future Work

Self-Alignment with Instruction Backtranslation

Introduction

Keep reading with a 7-day free trial