Contents
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Harnessing Diversity for Important Data Selection in Pretraining Large Language Models
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
DreamWaltz-G: Expressive 3D Gaussian Avatars from Skeleton-Guided 2D Diffusion
Attention Prompting on Image for Large Vision-Language Models
Differential Privacy Regularization: Protecting Training Data Through Loss Function Regularization
Accumulator-Aware Post-Training Quantization
Efficient Feature Interactions with Transformers: Improving User Spending Propensity Predictions in Gaming
FineZip : Pushing the Limits of Large Language Models for Practical Lossless Text Compression
Programming Every Example: Lifting Pre-training Data Quality like Experts at Scale
Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia
Evolutionary Greedy Algorithm for Optimal Sensor Placement Problem in Urban Sewage Surveillance
PokeFlex: Towards a Real-World Dataset of Deformable Objects for Robotic Manipulation
Semantically-Driven Disambiguation for Human-Robot Interaction
VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models
Authors: Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang
Source and references: https://arxiv.org/abs/2409.17066v1
Introduction
This paper introduces Vector Post-Training Quantization (VPTQ), a novel technique for extremely low-bit quantization of large language models (LLMs) to address the challenges of model size, storage, and inference efficiency.
Key Points
The authors use Second-Order Optimization to formulate the LLM vector quantization (VQ) problem and guide the quantization algorithm design.
They propose a Channel-Independent Second-Order Optimization method for a more granular VQ.
The authors introduce a brief and effective codebook initialization algorithm by decomposing the optimization problem.
The paper extends VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model.
The experimental results show that VPTQ outperforms state-of-the-art techniques in terms of model quantization perplexity and accuracy on various LLM benchmarks.
Methodology
The authors use Second-Order Optimization to formulate the LLM VQ problem and guide the quantization algorithm design. They further refine the weights using Channel-Independent Second-Order Optimization for a more granular VQ. The paper also introduces a brief and effective codebook initialization algorithm by decomposing the optimization problem. Additionally, the authors extend VPTQ to support residual and outlier quantization to enhance model accuracy and further compress the model.
Results and Findings
The experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, and 4.41-7.34 on LLaMA-3 compared to the state-of-the-art at 2-bit. VPTQ also achieves an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, and 11-22% on LLaMA-3 on QA tasks on average. Additionally, VPTQ only utilizes 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8× increase in inference throughput compared to the state-of-the-art.
Implications and Conclusions
The proposed VPTQ technique enables extremely low-bit quantization of LLMs, addressing the challenges of model size, storage, and inference efficiency. The significant improvements in model quantization perplexity, accuracy, and inference throughput demonstrate the effectiveness of VPTQ, making it a promising solution for the practical deployment of large language models.
Turn Every Application into an Agent: Towards Efficient Human-Agent-Computer Interaction with API-First LLM-Based Agents
Authors: Junting Lu, Zhiyang Zhang, Fangkai Yang, Jue Zhang, Lu Wang, Chao Du, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang
Source and references: https://arxiv.org/abs/2409.17140v1
Introduction
This paper proposes AXIS, a framework that enables efficient human-agent-computer interaction (HACI) by leveraging API-first LLM-based agents to replace traditional UI-based agents.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.