Greetings,
Welcome to the 32nd issue of the State of AI where we explore groundbreaking advancements shaping the future of artificial intelligence. This issue covers a spectrum of innovations: from S-LoRA's capability to manage thousands of concurrent LoRA adapters, to the intriguing Levels of AGI, which offers a framework for operationalizing progress towards Artificial General Intelligence (AGI).
We also delve into OtterHD, a high-resolution multi-modality model, and LLaVA-Plus, which pioneers in learning to use tools for creating multimodal agents. Lastly, we introduce mPLUG-Owl2, a revolutionary approach in multi-modal large language models through modality collaboration.
Each topic in this edition not only signifies a leap in AI technology but also provides a window into the future, where AI’s potential is boundless. We invite you to immerse yourself in these fascinating developments. Enjoy the read!
Best regards,
Contents
S-LoRA: Serving Thousands of Concurrent
LoRA Adapters Levels of AGI: Operationalizing Progress on the Path to AGI
OtterHD: A High-Resolution Multi-modality Model
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Authors: Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, Joseph E. Gonzalez, Ion Stoica
Source & References: https://arxiv.org/abs/2311.03285
Introduction
Large Language Models (LLMs) have become a crucial component in modern applications, ranging from natural language processing to more general tasks. In recent years, the "pretrain-then-finetune" paradigm has become the standard approach in deploying these models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often used to adapt a base model to multiple tasks, leading to a substantial collection of LoRA adapters derived from one base model.
The researchers introduce S-LoRA, a system designed to scalably serve many LoRA adapters. By storing all adapters in the main memory and fetching the adapters used by the currently running queries to the GPU memory, S-LoRA enables scalable serving of many fine-tuned models and offers the potential for large-scale customized fine-tuning services.
Batching and Scheduling
The batching strategy in S-LoRA separates the computation between the base model and the LoRA adapters. It optimizes the serving throughput by leveraging custom CUDA kernels to execute the additional xAB for all adapters separately. This process reduces GPU memory usage and enables higher throughput.
To reduce the number of active adapters in a running batch and increase the batch size, S-LoRA uses an "adapter clustering" strategy. However, clustering adapters can potentially hurt the average latency or fairness among adapters. Furthermore, S-LoRA utilizes an early abort strategy for admission control to ensure the desired latency of processing requests.
Memory Management
Serving multiple LoRA adapters simultaneously creates new challenges in memory management. S-LoRA proposes Unified Paging, a system that uses a unified memory pool to store the KV caches and adapter weights in a paged fashion. This approach efficiently reduces fragmentation, facilitates larger batch sizes, and optimizes latency overhead.
Tensor Parallelism
To efficiently parallelize across multiple GPUs, S-LoRA introduces a novel tensor parallelism strategy that minimizes communication cost for the added LoRA computation compared to the base model. This is achieved by scheduling communications on small intermediate tensors and aligning them with the base model's communication operations.
Performance Evaluation
S-LoRA is evaluated on several Llama models and shows a significant improvement in throughput and the number of served adapters compared to other state-of-the-art libraries. With a small overhead, S-LoRA can serve thousands of LoRA adapters on a single GPU or across multiple GPUs, outperforming other libraries both in terms of throughput and memory management.
Conclusion
S-LoRA effectively addresses the challenges of serving thousands of concurrent LoRA adapters by utilizing a unique batching strategy, an efficient memory management system, and a novel tensor parallelism strategy. The system significantly outperforms existing solutions and is highly applicable for large-scale customized fine-tuning services. The code for S-LoRA is openly available on GitHub, paving the way for broader adoption and use in practical applications.
By introducing S-LoRA, the research proposes a practical and scalable solution to the challenges of serving numerous fine-tuned models simultaneously. It's a promising innovation that can drastically improve the deployment of Large Language Models in real-world use cases, offering great benefits to both users and developers.
Levels of AGI: Operationalizing Progress on the Path to AGI
Authors: Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, Shane Legg
Source & References: https://arxiv.org/abs/2311.02462v1
Introduction
The concept of Artificial General Intelligence (AGI) has advanced from being a philosophical debate to a topic with practical relevance due to the rapid progress of Machine Learning (ML) models. AGI refers to an AI system that is as capable as a human at most tasks. Researchers have proposed various definitions, benchmarks, and frameworks for AGI, including timings and expected characteristics.
This paper presents a framework for classifying AGI models and their precursors by introducing AGI's performance levels, generality, and autonomy. This framework focuses on the path to AGI to facilitate comparisons, risk assessment, and progress measurement.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.