SliceGPT, DeepSeek-Coder, MM-LLMs, Lumiere, GPT-4 to Gemini and Beyond

Week 5, January 2024

Jan 29, 2024

∙ Paid

Welcome to the 43rd edition of the State of AI. This issue is packed with cutting-edge insights, from the innovative SliceGPT's approach to compressing large language models, to the pioneering DeepSeek-Coder, which marries large language models with programming for heightened code intelligence. We explore recent advances in multimodal Large Language Models, dive into the Lumiere model for video generation, and assess the evolving landscape of multi-modal large language models (MLLMs) from GPT-4 to Gemini, focusing on generalizability, trustworthiness, and causality. Prepare for an enlightening journey through the forefront of AI development!

Best regards,

State of AI

Get 7 day free trial

SliceGPT: Compress Large Language Models by Deleting Rows and Columns
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
MM-LLMs: Recent Advances in MultiModal Large Language Models
Lumiere: A Space-Time Diffusion Model for Video Generation
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Authors: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman

Source and references: https://arxiv.org/abs/2401.15024

The Dilemma of Large Language Models

Large language models (LLMs) have become the cornerstone of natural language processing, offering impressive performance in tasks like translation, sentiment analysis, and content generation. However, these powerful models come with a cost: they require massive compute and memory resources to operate. As a result, researchers have focused on finding ways to compress these models without sacrificing their performance, leading to the development of various model compression methods, such as distillation, tensor decomposition, pruning, and quantization.

The authors of SliceGPT propose a new post-training sparsification scheme designed explicitly for LLMs, aiming to maintain strong performance while reducing the model's size and, consequently, lowering compute and memory demands.

Introducing the SliceGPT Approach

Unlike traditional pruning methods that set some elements of weight matrices in a model to zero, the authors of SliceGPT suggest deleting entire rows or columns of these matrices instead. To achieve this, they introduce an interesting concept called computational invariance. Essentially, by applying orthogonal matrix transformations to each weight matrix in a transformer network, they can make the model invariant to these modifications, allowing them to project the signal matrix onto its principal components.

The key steps in the SliceGPT procedure involve performing these transformations and then deleting rows or columns of the modified matrices to reduce the model's overall size. The result is that weight matrices become smaller, and the signals passed between blocks of the neural network are smaller too – effectively reducing the embedding dimension of the network.

A Series of Successful Experiments

The SliceGPT researchers conducted several experiments to test their method on a variety of LLMs, including OPT, LLAMA-2, and Phi-2 models. They found that their approach effectively compressed these models, resulting in a 30% reduction in size while maintaining competitive performance on various language tasks. For example, when slicing a model by up to 25%, they were able to retain 99%, 99%, and 90% of its zero-shot task performance, respectively.

The sliced models also ran on fewer GPUs and operated faster than their dense counterparts without any additional code optimization. On 24GB consumer GPUs, they reduced the total compute for inference on LLAMA-270B to 64% of the dense model, and on 40GB A100 GPUs, they reduced it to 66%.

These impressive results indicate that SliceGPT is a promising technique for reducing the memory and computation demands of pre-trained LLMs in real-world applications.

Comparing to Existing Compression Techniques

The concept of deleting rows and columns to compress models is not entirely new, as similar techniques have been applied in the realm of convolutional neural networks (CNNs). However, many of these techniques require extensive fine-tuning and retraining of the models, which can be impractical when dealing with LLMs that have tens of billions of parameters.

SliceGPT stands out from the pack by eliminating the need for fine-tuning recovery, allowing the model to achieve competitive performance after compression. Moreover, unlike low-rank approximation techniques that require replacing each weight matrix with the product of two smaller matrices, SliceGPT simplifies model compression by replacing each weight matrix with a single smaller one.

The Growing Need for LLM Compression

The increasing prevalence of large language models in various applications highlights the importance of finding efficient and effective ways to compress these models. As data volume and model complexity continue to grow, researchers must find ways to balance performance and resource consumption to ensure these models' accessibility and usefulness.

SliceGPT is an essential step in this direction by offering an innovative method to compress LLMs without sacrificing performance. This approach has the potential to make LLMs more accessible by reducing the significant compute and memory requirements they often demand when deployed.

Future Prospects for Transformer Compression

The impressive results of SliceGPT showcase the potential of using orthogonal matrix transformations and computational invariance in transformer networks to shrink models without degrading their performance. The authors hope that their work in this area will inspire more research and innovations in reducing pre-trained models' memory and computation demands.

With advances like SliceGPT, we can expect the future of large language model and transformer compression to involve not only more refined techniques but also more widespread adoption and deployment of these powerful models in real-world applications. As a result, the true impact of computational invariance and transformer network compression may only just be beginning to take shape.

DeepSeek-Coder: When the Large Language Model Meets Programming - The Rise of Code Intelligence

Authors: Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y. Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, Wenfeng Liang

Source and references: https://arxiv.org/abs/2401.14196

Introduction

The landscape of software development is changing rapidly, thanks to the advancement of large language models (LLMs) that have brought about a new era of code intelligence. These models have the potential to automate and streamline many aspects of coding, such as bug detection and code generation, enhancing productivity and reducing human error.

However, one of the main challenges in this domain is the performance gap between open-source models and their closed-source (proprietary) counterparts. In response to this challenge, the authors introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B parameters. These models have been trained from scratch on a diverse corpus of 2 trillion tokens from 87 programming languages.

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.