State of AI

Week 3, July 2023

Jul 17, 2023

∙ Paid

Greetings,

Welcome to the 15th edition of the State of AI. This issue unfurls an array of intriguing advancements within the AI realm. Our exploration commences with the unraveling of the mysteries of RLHF in large language models, followed by an in-depth analysis of the scaling of autoregressive multi-modal models, a compelling intersection of pretraining and instruction tuning.

We then shift our focus to HyperDreamBooth's application of HyperNetworks for swift personalization of text-to-image models, a testament to the fluidity and adaptability of AI systems. Further along, we delve into Semantic-SAM's endeavor to segment and recognize anything at any granularity, setting a new benchmark for versatility in machine learning. Lastly, we glance at AnimateDiff's innovative approach to animating personalized text-to-image diffusion models sans specific tuning, introducing a new paradigm of animation in AI.

Each topic in this edition offers a riveting exploration into the expanding horizons of AI research and development. Buckle up for an enriching journey that promises to spark your curiosity and deepen your understanding. Enjoy!

Best regards,

State of AI

Secrets of RLHF in Large Language Models Part I: PPO
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
Semantic-SAM: Segment and Recognize Anything at Any Granularity
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Secrets of RLHF in Large Language Models Part I: PPO

Authors: Rui Zheng, Shihan Dou, Songyang Gao, Yuan Hua, Wei Shen, Binghai Wang, Yan Liu, Senjie Jin, Qin Liu, Yuhao Zhou, Limao Xiong, Lu Chen, Zhiheng Xi, Nuo Xu, Wenbin Lai, Minghao Zhu, Cheng Chang, Zhangyue Yin, Rongxiang Weng, Wensen Cheng, Haoran Huang, Tianxiang Sun, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Fudan NLP Group

Source & References: https://arxiv.org/abs/2307.04964v1

Introduction

Large Language Models (LLMs) have come a long way in advancing artificial general intelligence (AGI). Their primary goal is to function as assistants that are helpful, honest, and harmless to humans. Reinforcement Learning with Human Feedback (RLHF) is a critical technological paradigm to make sure these models align with human values. However, RLHF still faces significant challenges, such as reward design, environment interaction, agent training, and the huge trial and error cost of large language models, making it difficult for AI researchers to advance alignment and safely land LLMs. In this paper, the authors dissect the RLHF framework, re-evaluate the inner workings of Proximal Policy Optimization (PPO), and explore the PPO-max algorithm to improve training stability.

Reward Modeling

For training reward models, the paper's authors suggest utilizing a dataset composed of paired comparisons between two responses generated for the same input. The KL divergence term is introduced to ensure that the RL policy's output does not deviate drastically from the samples encountered during the reward model's training, serving as an entropy bonus and fostering exploration.

Reinforcement Learning

Applying RL to dialogue generation in large language models is challenging due to the sheer size of the state-action space. The paper breaks down the process of applying policy gradient methods, which directly optimize agents' policy instead of learning a value function. The authors propose using advantage functions to counter the high variance in policy gradients, ultimately promoting stable learning.

PPO Algorithm Overview

The paper delves into the inner workings of the PPO algorithm, which is designed to improve training stability in Large Language Models. PPO-max is proposed as a refined version of the algorithm, incorporating carefully calibrated implementations to avoid interference while ensuring that the models can be better aligned with humans.

Experiments and Results

Through various experiments, the authors discover that the quality of the reward model directly determines the upper bound of the policy model, and designing an appropriate PPO algorithm is crucial for RLHF's successful training. Their proposed PPO-max algorithm demonstrates comparable alignment performance to ChatGPT on 7B and 13B SFT models.

Contributions and Open-Source Release

Eager to further the development of LLMs, the authors have made three significant contributions. First, they released competitive Chinese and English reward models, which exhibit good cross-model generalization ability and alleviate the cost of relabeling human preference data. Second, they conducted in-depth analysis on the inner workings of the PPO algorithm and proposed the PPO-max algorithm to ensure stable model training. Third, they released the complete PPO-max codes, intending to enable researchers to better align LLMs with humans.

Conclusion

In conclusion, the "Secrets of RLHF in Large Language Models Part I: PPO" paper delves deep into the challenges that reinforcement learning faces in training large language models and offers a novel, promising solution in the form of the PPO-max algorithm. This algorithm provides researchers with a robust solution to ensure the stability of RLHF training and effectively align LLMs with human values. The open-source release of the reward models and PPO-max codes aims to promote further investigation and development in the field of large language models' safety.

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Authors: Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, Candace Ross, Adam Polyak, Russell Howes, Vasu Sharma, Puxin Xu, Hovhannes Tamoyan, Oron Ashual, Uriel Singer, Shang-Wen Li, Susan Zhang, Gargi Ghosh, Yaniv Taigman, Maryam Fazel-Zarandi, Asli Celikyilmaz, Luke Zettlemoyer, Armen Aghajanyan

Source & References: https://ai.meta.com/research/publications/scaling-autoregressive-multi-modal-models-pretraining-and-instruction-tuning/

Introduction

Researchers from the FAIR and YerevaNN teams recently introduced CM3Leon, a decoder-only multi-modal language model that is capable of generating and infilling both text and images. The model is built on the CM3 multi-modal architecture and demonstrates the benefits of scaling up and tuning on more diverse instruction-style data. This new model achieves state-of-the-art performance in text-to-image generation with significantly less training compute than current methods.

Get 7 day free trial

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.

State of AI

Week 3, July 2023

Contents

Secrets of RLHF in Large Language Models Part I: PPO

Introduction

Reward Modeling

Reinforcement Learning

PPO Algorithm Overview

Experiments and Results

Contributions and Open-Source Release

Conclusion

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

Introduction

Keep reading with a 7-day free trial