SwitchHead, FreeInit, LLM360, Amphion & more 🎅

Week 3, December 2023

Dec 18, 2023

∙ Paid

Greetings,

Welcome to the landmark 37th edition of the State of AI. This issue is a treasure trove of innovative AI advancements, covering a spectrum from the distributed inference and fine-tuning of large language models over the internet to the pioneering techniques in audio and speech generation.

We explore the cutting-edge "SwitchHead," which accelerates Transformers with a Mixture-of-Experts Attention mechanism, and delve into "FreeInit," a groundbreaking approach to bridging the initialization gap in video diffusion models. We also introduce "LLM360," a step towards fully transparent, open-source large language models, and showcase "Amphion," an open-source toolkit revolutionizing audio, music, and speech generation.

Each article in this edition is a testament to the ever-evolving and exciting world of AI, offering deep insights and sparking imagination. We hope you find these developments as fascinating as we do.

Best regards,

State of AI

Get 7 day free trial

Distributed Inference and Fine-tuning of Large Language Models Over The Internet
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
FreeInit: Bridging Initialization Gap in Video Diffusion Models
LLM360: Towards Fully Transparent Open-Source LLMs
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin Raffel

Source and References: https://arxiv.org/abs/2312.08361

Introduction: Democratizing Access to Powerful Language Models

In recent years, the natural language processing (NLP) community has made great strides thanks to large language models (LLMs) such as OpenAI's GPT-3. These models, with billions of parameters, offer state-of-the-art performance on a wide range of tasks but also require cutting-edge hardware to run, making them inaccessible for many researchers and developers.

In this paper, the authors propose a novel solution to this problem. They develop efficient methods for performing inference and fine-tuning of LLMs over the internet by distributing the computational load across multiple devices. This approach aims to make large-scale NLP more cost-effective and accessible, even when using consumer-grade hardware and networks.

The Need for Better Solutions for Inference and Fine-tuning

Current approaches to run large language models on resource-constrained hardware include model parallelism and parameter offloading. Though these approaches can help to varying extents, they run into bottlenecks when faced with autoregressive generation and caching transformer-based models' attention values. Moreover, for use-cases like chatbots and search engines, the current solutions prove to be inefficient.

To derive a more efficient solution, the authors focus on two main workloads: inference and fine-tuning. They develop an algorithm for the distributed, fault-tolerant inference of LLMs with over 50 billion parameters on unreliable devices and an adaptation for fine-tuning with minimal parameter updates.

A Novel Algorithm: Distributed Inference with Fault-Tolerant Recovery

The authors propose a new algorithm for distributed LLM inference, where the work is completed by a swarm of servers, each holding a subset of the pretrained model's layers. Clients, responsible for running inference or fine-tuning jobs, delegate computations to these servers while only holding input and output embeddings.

The major challenge with this approach is ensuring fault tolerance, as servers could fail or experience network issues. The authors introduce "dual attention caches" to overcome this problem: a server-side cache that stores past attention keys and values, and a client-side cache that stores past inputs sent to a server. If a server fails, the client can use its cache to restore the server's state, allowing inference to continue despite hardware or network issues.

In addition to fault tolerance, this algorithm also demonstrates better performance than traditional methods when faced with network latency. Clients maintain information about network latency to prioritize selecting the best connected server when assigning work.

Extending the Algorithm to Support Fine-tuning

The authors also extend their work to support fine-tuning for various efficient techniques like 'soft' prompts, adapters, or similar configurations. By modifying their distributed algorithm to include backpropagation, they develop a fault-tolerant fine-tuning system that should help to democratize access to large-scale NLP training.

Real-World Performance: Faster and More Efficient Inference

To test their proposed methods, the authors run experiments on real-world distributed infrastructures to evaluate the performance of their algorithms on LLMs like Llama 2 (70B) and BLOOM (176B). They conduct tests in both simulated network conditions and real-world setups spanning two continents.

These experiments show that their algorithms can perform autoregressive generation at least ten times faster than local offloading, even on geographically distributed devices connected by consumer-grade networks.

Looking Ahead: Democratizing Access to Powerful NLP Tools

In conclusion, the authors have put forth a novel solution to one of the biggest challenges in large-scale NLP: allowing more researchers and developers access to powerful language models regardless of their hardware or network constraints. The combination of a fault-tolerant, distributed inference algorithm, and an extension for fine-tuning opens up new possibilities for cost-efficient, large-scale NLP tasks even on consumer-grade hardware.

The researchers have developed PETALS, a decentralized system based on their proposed algorithms, to showcase the potential for running large language models efficiently over the internet. The code and documentation for PETALS are publicly available, inviting the community to explore and harness the power of large language models in a more accessible and efficient way.

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

Authors: Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber

Source and References: https://arxiv.org/abs/2312.07987

Transformers - A High-Stakes Game

Transformers have had a profound impact on the world of machine learning and natural language processing. They're powerful language models whose most well-known progeny include OpenAI's GPT-3 and Google's BERT. However, despite their impressive performance, Transformers come with a steep price - their size and resource requirements make them too expensive for many researchers and organizations to access.

In the quest for more efficient models, a new paper by Róbert Csordás, Piotr Piękos, Kazuki Irie, and Jürgen Schmidhuber presents a method called SwitchHead. It promises to match the language modeling performance of standard Transformer models while significantly reducing both computation and memory requirements. Let's dive into how they achieved this and what it could mean for the future of Transformers.