Welcome to State of AI Pulse — Single Most Noteworthy Paper Summary in Each Issue!
LLM in a flash: Efficient Inference with Limited Memory by Apple
Introducing State of AI Pulse
Welcome to the first issue of "State of AI Pulse." Our goal is simple: each issue will focus on breaking down one important AI or Machine Learning paper. We aim to provide clear, in-depth analysis so that our readers, whether they're professionals, academics, or enthusiasts, can easily understand key developments in the field. In this issue, we start with a new paper by Apple, that suggests a new way to overcome limitations of current LLMs. Let's dive in.
LLM in a Flash: Efficient Large Language Model Inference with Limited Memory
Authors: Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, Karen Khatamifard, Minsik Cho, Carlo C Del Mundo, Mohammad Rastegari, Mehrdad Farajtabar
Source and references: https://arxiv.org/abs/2312.11514
Flash and the Big Language Model
As the age of large language models (LLMs) dawns, we're seeing models like GPT-3, OPT, and PaLM make breakthroughs in natural language processing performance. However, their enormous size and computational requirements present roadblocks, especially for devices with limited DRAM capacity.
To overcome these limitations, the authors of this paper propose storing model parameters on flash memory and loading only essential data to DRAM on-demand. This innovative approach allows devices to run models up to twice their available DRAM size while significantly boosting inference speeds.
Why Flash Memory is the Key
Using flash memory isn't without its challenges. Flash storage offers high capacity but lower bandwidth and response times compared to DRAM. Furthermore, random read throughput is less efficient when working with smaller chunks of data.
The researchers propose two complementary techniques to address this issue:
Reducing the amount of data transferred between flash and DRAM by leveraging the high sparsity found in LLMs' Feed-Forward Network (FFN) layers.
Increasing data chunk sizes to improve flash memory throughput.
By exploiting sparsity and employing efficient memory management strategies, the authors can load only around 2% of the FFN layer from flash for each inference query.
Harnessing the Power of Sparsity
The groundbreaking work of this paper lies in its use of sparsity. Models like OPT exhibit 90% sparsity, allowing for selective loading of only the necessary parameters from flash memory to DRAM during inference.
The authors also employ a sliding window technique to manage neuron data. This approach stores neuron data for only a recent subset of input tokens in memory, freeing up memory resources previously allocated to older, no-longer-needed tokens. As the number of tokens in the window increases, the volume of data loaded for each new token decreases, optimizing memory usage within DRAM capacity constraints.
Optimizing Data Chunk Sizes
To further improve flash memory throughput, the researchers propose bundling rows and columns in the upward and downward projection layers. By storing corresponding columns and rows together in flash memory, data chunks can be consolidated for more efficient reading. This increases the size of the chunks being read, boosting flash memory throughput.
Though initial attempts to bundle neurons based on co-activation fell short, the authors present this as a potential area of study for more effective bundling and leveraging neuron data in future research.
Efficient DRAM Data Management
To further optimize memory-centered performance, authors suggest preallocating memory and establishing a corresponding data structure for efficient management. This data structure includes elements like pointers, matrix, and bias, reducing overhead and DRAM transfer costs.
By combining sparsity-focused techniques with hardware-conscious design and context-adaptive loading, the authors have paved the way for efficient LLM inference on limited-memory devices.
The Flashy Results
Applying these techniques enables running models up to twice the size of available DRAM, with a 4-5x increase in inference speed on CPUs and a 20-25x increase on GPUs. This is a significant achievement considering the prevailing practice of loading the entire model into memory for inference.
The authors successfully tackle the challenge of running large language models exceeding DRAM capacity by exploiting sparsity, intelligently managing memory, and optimizing data transfer. With this ground-breaking work, they demonstrate that efficient LLM inference is possible even on devices with limited memory, making these powerful models accessible to a broader range of applications and devices.
Thank you for reading State of AI!
Subscribe now and get our free premium for a limited time!