Greetings,
Welcome to the 26th edition of the State of AI. In this issue, we uncover the intricacies of QA-LoRA, a groundbreaking approach to adapting large language models. We explore the marvels of efficient streaming with attention sinks, dive deep into the Boolformer's symbolic regression capabilities, get inspired by DreamGaussian's pioneering methods for 3D content creation, and marvel at the versatility of AnyMAL, the any-modality augmented language model.
Each of these subjects shines a spotlight on the multifaceted progress and broadening horizons in the AI landscape. Your journey of discovery awaits. Enjoy!
Best regards,
Contents
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Efficient Streaming Language Models with Attention Sinks
Boolformer: Symbolic Regression of Logic Functions with Transformers
DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation
AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Quantization-Aware Low-Rank Adaptation for Large Language Models
Authors: Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian
Source & References: https://arxiv.org/abs/2309.14717v1
Introduction
Large language models (LLMs) have become an increasingly dominant paradigm in natural language processing due to their state-of-the-art performance across a wide range of language understanding tasks. However, these models come with a significant downside - their deployment in real-world scenarios is hindered by their high computational and memory requirements.
To address this issue, researchers have proposed various methods like parameter-efficient fine-tuning (PEFT) and quantization. One popular PEFT method is Low-Rank Adaptation (LoRA), which involves fine-tuning low-rank matrices to complement the pre-trained weights. Despite its memory efficiency, LoRA falls short in maintaining computational efficiency. In this paper, the authors present Quantization-Aware Low-Rank Adaptation (QA-LoRA), an approach that combines PEFT and quantization to provide a more balanced solution.
The Challenges of Combining PEFT and Quantization
The main objective of QA-LoRA is to achieve efficient adaptation and deployment of LLMs. This entails computational efficiency during both the fine-tuning and inference stages. While LoRA provides parameter-efficient fine-tuning, it retains a significant memory usage, especially when dealing with larger models. Quantization, on the other hand, can help reduce memory footprint and computational costs without retraining the LLMs, but it is often accompanied by unsatisfying accuracy, especially when quantization bit width is low.
Integrating PEFT and quantization poses additional challenges, such as propagating gradients through discrete values and optimizing quantization parameters. A naive solution, like performing post-training quantization (PTQ) after PEFT, often results in a significant loss in accuracy.
The QA-LoRA Approach
The key insight behind QA-LoRA is the balance between the degrees of freedom for quantization and adaptation. This is achieved through introducing group-wise operators, which increase the degree of freedom for low-bit quantization while decreasing that for LoRA.
The authors implement this approach by partitioning each column of the pre-trained weight matrix into multiple groups, with an individual pair of scaling and zero factors being used per group. This allows for a more granular level of control, enabling the quantized weights and the low-rank weights to be merged without using high-precision numbers (e.g., FP16). Furthermore, QA-LoRA also reduces the number of parameters in the auxiliary matrix A, since it can be set to be an L x D_int matrix without further constraints.
QA-LoRA outperforms existing methods like QLoRA, which quantizes the pre-trained weights into NF4 and adds LoRA. However, QLoRA falls short on two fronts: 1) it reverts weights back to FP16 after fine-tuning, making the deployed model still slow, and 2) there is no operator-level optimization for NF4 yet, hindering acceleration during the fine-tuning and inference stages.
With QA-LoRA, two primary benefits are achieved: 1) an efficient fine-tuning stage thanks to the quantized LLM weights, and 2) a lightweight, fine-tuned model without the need for PTQ, which often incurs a loss of accuracy.
Experiments and Results
The effectiveness of QA-LoRA is demonstrated through extensive experimentation on LLaMA and LLaMA2 model families, which are fine-tuned on various language understanding benchmarks. The authors show that QA-LoRA consistently outperforms QLoRA with PTQ across all LLM scales.
One of the main advantages of QA-LoRA is that it does not suffer an accuracy loss due to PTQ. It exhibits equally good, if not better, performance compared to QLoRA without PTQ, making it an effective off-the-shelf method for the joint quantization and adaptation of LLMs.
Conclusion
In summary, the authors present QA-LoRA, a simple yet effective method for jointly incorporating quantization and low-rank adaptation in large language models. By introducing group-wise operators, QA-LoRA balances the degrees of freedom for quantization and adaptation, resulting in both an efficient fine-tuning stage and a fast, accurate deployed model.
QA-LoRA provides a viable solution for deploying large language models on edge devices, as it is easily implemented and generalizable across different scenarios. With its computational efficiency and promising performance, QA-LoRA can pave the way for more extensive adoption and application of large language models in real-world scenarios.
Efficient Streaming Language Models with Attention Sinks
Authors: Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
Source & References: https://arxiv.org/abs/2309.17453
Introduction
Large Language Models (LLMs) have transformed how we interact with technology, powering numerous applications such as chatbots, document summarization, code completion, and question-answering. A critical challenge for these models is to efficiently handle inputs with infinite length (streaming applications), like a chatbot that converses throughout the day.
The proposed paper, "Efficient Streaming Language Models with Attention Sinks," introduces a framework called StreamingLLM. It improves the efficiency of LLMs trained on finite data while maintaining high performance for infinite sequence lengths, making them suitable for streaming applications without additional fine-tuning.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.