ViTAR, Aurora-M, Jamba, Transformer-Lite & sDPO

Week 1, April 2024

Apr 02, 2024

Greetings,

Welcome to the 52nd edition of the State of AI. In this issue, we explore a groundbreaking approach to vision transformers, the first open-source multilingual language model red-teamed for safety, a powerful hybrid transformer-mamba language model, the potential for deploying large language models on mobile GPUs, and the innovative stepwise DPO training approach for aligning large language models.

Dive into this issue to uncover the exciting developments in AI, showcasing the potential for transformative applications and pushing the boundaries of technology. Enjoy!

Best regards,

State of AI

Get 7 day free trial

ViTAR: Vision Transformer for Any Resolution
Aurora-M: Red-teaming the First Open Source Multilingual Language Model
Jamba: Unveiling the Hybrid Transformer-Mamba Language Model
Transformer-Lite: Bringing Large Language Models to Mobile GPUs
Stepwise DPO: A New Frontier in Large Language Model Alignment

ViTAR: Vision Transformer with Any Resolution

Authors: Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Source and references: https://arxiv.org/abs/2403.18361

Introduction

In a recent paper, a group of researchers has tackled a significant challenge faced by Vision Transformers (ViTs): their constrained scalability across different image resolutions. Typically, ViTs experience a performance decline when processing resolutions different from those seen during training. The authors introduce two key innovations to address this issue: Adaptive Token Merger (ATM) and Fuzzy Positional Encoding (FPE). Their resulting model, ViTAR (Vision Transformer with Any Resolution), demonstrates impressive adaptability, achieving 83.3% top-1 accuracy at a 1120x1120 resolution and 80.4% accuracy at a 4032x4032 resolution, all while reducing computational costs.

Adaptive Token Merger (ATM)

The ATM module is designed to effectively handle multiple resolutions in a single model. The authors designed ATM to adaptively merge input tokens and progressively reduce their number based on specific grid shapes until a fixed number of tokens is reached. This approach significantly enhances the model's resolution adaptability and reduces the computational burden when processing high-resolution images.

ATM progressively merges tokens within each unit, mapping all tokens onto a grid of fixed shape. This produces a collection of "grid tokens" which then undergo feature extraction by a sequence of multiple Multi-Head Self-Attention modules.

Fuzzy Positional Encoding (FPE)

Positional encoding plays a crucial role in ViTs. However, commonly used learnable positional encoding and sin-cos positional encoding have limited resolution robustness. While convolution-based positional encoding demonstrates better resolution robustness, its perception of adjacent tokens hinders its compatibility with self-supervised learning frameworks like Masked AutoEncoder (MAE).

FPE addresses this issue by providing the ViT with fuzzy positional information. During the training phase, FPE changes the positional coordinates provided to the model within a certain range, preventing the model from overfitting to position at specific resolutions. This enhances the model's resolution adaptability and helps it to generalize better when faced with different input resolutions.

Multi-resolution Training

Similar to ResFormer, the authors also employ a multi-resolution training approach to train ViTAR, covering a wider spectrum of resolutions during training. This strategy enables the model to adapt to an extensive range of resolutions, achieving favorable results in image classification tasks.

Experimental Results

The researchers conducted experiments on various vision tasks to validate the effectiveness of the proposed method. ViTAR outperformed other methods in the task of image classification, demonstrating impressive resolution adaptability and lower computational complexity. On tasks that require high-resolution inputs (instance segmentation, semantic segmentation), the model achieved results similar to ResFormer and DeiT with 50% of the FLOPs.

ViTAR also showed compatibility with self-supervised learning frameworks like MAE. This compatibility enables the model to be trained on large-scale unlabeled datasets, which could further improve its performance and applicability to real-world scenarios.

Impact and Future Research

ViTAR offers a cost-effective solution for enhancing the resolution scalability of ViTs, paving the way for more versatile and efficient high-resolution image processing. Its remarkable adaptability and compatibility with self-supervised learning frameworks make it an attractive choice for scenarios where adaptability and computational efficiency are crucial.

Future research could explore further improvements in the model's resolution adaptability or its compatibility with other self-supervised learning frameworks. Additionally, exploring the application of ViTAR to other vision tasks and even extending the model to non-vision tasks (e.g., natural language processing) could unveil new possibilities and deepen our understanding of the adaptability and scalability of Transformer-based models.

Conclusion

The authors presented ViTAR, a ViT with Adaptive Token Merger and Fuzzy Positional Encoding, which allows for better resolution adaptability and computational efficiency. These innovations successfully address the challenge of constrained scalability across different image resolutions in standard ViTs. ViTAR's impressive adaptability and compatibility with self-supervised learning frameworks make it a promising choice for future research and real-world applications in the field of computer vision.

AURORA -M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order

Authors: Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T Stillerman, Felix Friedrich, Prateek Yadav, Tanmay Laud, Vu Minh Chien, Terry Yue Zhuo, Diganta Misra, Ben Bogin, Xuan-Son Vu, Marzena Karpinska, Arnav Varma Dantuluri, Wojciech Kusa, Tommaso Furlanello, Rio Yokota, Niklas Muennighoff, Suhas Pai, Tosin Adewumi, Veronika Laippala, Xiaozhe Yao, Adalberto Junior, Alpay Ariyak, Aleksandr Drozd, Jordan Clive, Kshitij Gupta, Liangyu Chen, Qi Sun, Ken Tsui, Noah Persaud, Nour Fahmy, Tianlong Chen, Mohit Bansal, Nicolò Monti, Tai Dang, Ziyang Luo, Tien-Tung Bui, Roberto Navigli, Virendra Mehta, Matthew Blumberg, Victor May, Huu Nguyen, Sampo Pyysalo

Source and references: https://arxiv.org/abs/2404.00399v1

The Challenge of Open Source Multilingual Language Models

Large Language Models (LLMs) drive many AI applications like machine translation, text summarization, and code generation. The problem with LLMs, however, is that they involve heavy computational costs which limits accessibility. Open-source projects like BLOOM, StarCoder, and OLMo have tried to democratize access to LLMs, but they still struggle with non-English texts and are expensive.

AURORA-M: A New Multilingual Language Model

AURORA-M is a 15-billion-parameter multilingual open-source LLM designed to address these challenges. It's trained for English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from the StarCoderPlus model on 435 billion tokens, AURORA-M hits 2 trillion tokens in total training. This extensive pretraining enables the model to understand various languages and code effectively.

Two-stage Training Curriculum for AURORA-M

AURORA-M uses a two-stage curriculum for continual pretraining: Continual Auxiliary Pretraining (CAP) and Continual Alignment Tuning (CAT). CAP focuses on exposing the model to diverse multilingual web data for a solid foundation. CAT is a strategic data-mixing approach to enhance the model's performance in targeted areas and align with predefined objectives. Both stages incorporate instruction tuning datasets.

Comprehensive Data Curation and Filtering

Data used for AURORA-M's training comes from numerous sources like Stack, RefinedWeb, RedPajama, a subset of the Pile, HPLT, and MC4. The CAP stage includes 377 billion tokens of web data, while CAT uses 58 billion tokens. To remove toxic content and low-quality text, multiple data filters are applied.

Adherence to Safety Guidelines and AI Development Laws

A key aspect of AURORA-M is its commitment to safety, specifically aligning with the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. AURORA-M is the first open-source multilingual model fine-tuned on a comprehensive collection of human-reviewed safety instructions.

The Biden-Harris Redteam Dataset for Safety Evaluation

To evaluate AURORA-M's safety, the Biden-Harris Redteam Dataset was curated with 5,000 red-teaming instructions and responses focused on harm, cyber-attacks, illegal acts, dual usage technologies, privacy, and circumventing controls. AURORA-M is then evaluated for safety performance using this dataset and other safety evaluation benchmarks.

Rigorous Evaluation Showcasing Robustness of AURORA-M

AURORA-M is evaluated across various tasks and languages to demonstrate its ability to retain knowledge while acquiring new capabilities through continual pretraining. It showcases competitive performance in English and coding while excelling in multilingual settings. Safety evaluations display AURORA-M's strong alignment with responsible AI development practices.

English Evaluation Datasets and Results

English evaluation measures AURORA-M's performance on question-answering tasks like OpenBookQA and TriviaQA, natural language inference using HellaSwag, reading comprehension with SQuAD2.0 and XWINO, and arithmetic reasoning with GSM8K. For these tasks, 8-shot inference is used, showing competitive performance.

Japanese, Finnish, Hindi, and Vietnamese Evaluation Datasets and Results

For non-English evaluations, various benchmarks like llm-jp-eval, FIN-bench, and mlmm are used. These datasets cover tasks such as multiple-choice question answering, free-form question answering, machine reading comprehension, automatic summarization, and machine translation. Results highlight AURORA-M's exceptional performance in multilingual settings.

Conclusion and Takeaways

AURORA-M is a pioneering open-source multilingual LLM that outperforms alternatives in multilingual and safety evaluations. By continually pretraining on a diverse dataset and aligning with safety guidelines, AURORA-M has the potential to democratize access to LLMs while promoting responsible development. AURORA-M and its variants are released for the community to use and build upon, fostering further innovation in the open-source LLM development landscape.

Jamba: A Hybrid Transformer-Mamba Language Model

Authors: Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, Yoav Shoham

Source and references: https://arxiv.org/abs/2403.19887

Meet Jamba, the novel hybrid language model

Attention, machine learning enthusiasts! There's a new player in town: Jamba, a large language model that combines Transformer and Mamba layers with a mixture of experts (MoE) architecture. This powerful model manages to fit in a single 80GB GPU and provides high throughput and small memory footprint.

Jamba's flexible architecture for specific configurations

Jamba's unique hybrid architecture is designed to combine the best of both worlds: Transformers and Mamba. Transformers are popular but suffer from high memory and compute requirements, especially for long context lengths. On the other hand, Mamba is a state-space model which is more efficient to train and handles long-distance relationships better but lacks the performance of comparably sized Transformer language models.

To exploit the benefits of both, Jamba combines Transformer and Mamba layers with a flexible architecture that allows the user to adjust aspects such as the number of layers, the ratio of attention-to-Mamba layers, the number of experts, and the number of top experts used at each token.

Jamba: A strong contender in language modeling

When compared to other publicly available models on standard language model benchmarks and long-context evaluations, Jamba demonstrates state-of-the-art performance. It supports a context length of up to 256K tokens, outperforming other models such as Mixtral and Llama-2.

How Jamba fits in a single 80GB GPU

The implementation is designed to make the most of a single 80GB GPU. It features a series of Jamba blocks, each with a configurable number of layers, a ratio of attention-to-Mamba layers, MoE layers, a total number of experts per layer, and the number of top experts used at each token.

Jamba manages to maintain a small KV cache memory, even when working with 256K token contexts. This results in a reduced memory footprint and better throughput compared to other models with similar numbers of parameters.

The remarkable efficiency of Jamba

One of Jamba's standout features is its high throughput, particularly for long contexts. It processes large data batches swiftly, achieving 3 times the throughput of the Mixtral model for long contexts, while fitting in a single GPU. Jamba's high throughput is maintained as the context length increases, making it particularly suited for working with longer sequences in real applications.

Powerful training infrastructure and dataset

Jamba benefits from an in-house proprietary framework that allows efficient large-scale training. It was trained on NVIDIA H100 GPUs using techniques like FSDP, tensor parallelism, sequence parallelism, and expert parallelism.

The model is trained on a proprietary dataset that consists of text data from the Web, books, and code. This dataset is continually updated, with the latest update made in March 2024.

Impressive results on academic benchmarks

Jamba has proven its mettle with impressive results on a wide range of standard academic benchmarks, such as HellaSwag, WinoGrande, ARC-E, PIQA, BoolQ, GSM8K, and MMLU, to name a few. In most of these tests, Jamba performs comparably to or even better than other publicly available models.

An invitation to further explore the novel Jamba architecture

The Jamba language model is a powerful tool with impressive performance and throughput, especially when handling long-context evaluations. The authors encourage further exploration of this novel architecture by making the weights of the Jamba implementation publicly available under a permissive license.

The flexibility of Jamba's architecture allows for the possibility of improvements and optimizations made by the machine learning community. Jamba paves the way for future advancements in language models by offering a flexible and efficient alternative to traditional models.

With Jamba, the machine learning community has a new and promising model to build upon and explore, as it holds great potential for further optimization and adaptation. Its versatile architecture, high throughput, and relatively low memory requirements make it a top contender in the world of language models. Watch out - Jamba is here to shake things up!

Transformer-Lite: High-Efficiency Deployment of Large Language Models on Mobile Phone GPUs

Authors: Luchang Li, Sheng Qian, Jie Lu, Lunxi Yuan, Rui Wang, Qin Xie.

Source and references: https://arxiv.org/abs/2111.02185

The Importance of On-Device Deployment for Large Language Models

Large language models (LLMs) are well known for revolutionizing various applications, such as intelligent assistants, text summarization, translation, and multi-modal tasks on mobile devices. However, deploying these LLMs on mobile devices remains a challenge due to their limited hardware performance, memory, and storage.

Most current applications employ cloud-based deployment to mitigate these limitations. Still, cloud deployment is associated with high costs and limited potential for specific applications. To address this issue, the researchers aim to develop an efficient way to deploy LLMs on mobile devices, which can both improve accuracy and user experience.

Four Optimization Techniques for Efficient LLM Deployment on Mobile GPUs

To achieve high-efficiency LLM deployment on mobile GPUs, the authors propose four optimization techniques:

Symbolic expression-based approach to support dynamic shape model inference: This allows for better memory reuse, execution scheduling, and reduced time consumption during shape updating.
Operator optimizations and execution priority setting: These enhancements improve both performance and reduce phone lagging during LLM inference. Additionally, they also enable operator fusions and specific matrix multiplication implementations for prefill and decoding stages.
M0E4 FP4 quantization: This method minimizes the performance overhead in dequantization, enabling more efficient matrix multiplications.
Sub-tensor based technique: This eliminates the need for copying KV cache after LLM inference, improving efficiency.

These optimization techniques are implemented in the authors' mobile inference engine, Transformer-Lite, which is compatible with both Qualcomm and MTK processors.

Symbolic Expression-Based Dynamic Shape Inference

Dynamic shape tensors present a unique challenge, as their shape relationship is not easily discernible. To address this issue, the authors introduce a symbolic expression-based approach to express and infer the dynamic shape of tensors. The symbolic expressions enable accurate resolution of the shape relationship among tensors, crucial for memory reuse and performance optimization.

To facilitate memory reuse, the authors employ the OpenCL buffer memory type and use the image from buffer extension to generate an image reference from the buffer. This technique eliminates the need for converting data from different memory types or allocating new memory.

Furthermore, to reduce shape updating time consumption during LLM inference, the authors use the attention mask mechanism to pad the model input sequence length to multiples of 64 or 128 during the decoding stage. This method reduces the need for time-consuming shape derivation in the model inference stage.

Operator Optimizations and Lagging Reduction

Matrix multiplication is the most time-consuming operator in LLM inference. However, other operators also contribute to a considerable overhead. To mitigate this, the authors conduct intensive operator fusions to improve LLM inference efficiency.

The authors also address the issue of phone lagging during LLM inference. By setting the execution priority of deep learning model operators to the lowest level using OpenCL extensions provided by Qualcomm and ARM, the authors can effectively reduce phone lagging during LLM inference.

M0E4 FP4 Quantization for Enhanced Performance

To address the performance overhead in the dequantization process, the authors propose an FP4 quantization method called M0E4. This method allows for efficient conversion of 4-bit quantized data to floating-point numbers with only two bitwise operations, significantly reducing performance overhead and seamlessly integrating with GPTQ and AWQ quantization methods.

Evaluating Transformer-Lite with Various LLMs

The authors evaluated Transformer-Lite's performance using LLMs with varied architectures and parameters, ranging from 2B to 14B. Specifically, they achieved prefill and decoding speeds of 121 token/s and 14 token/s for ChatGLM 6B, and 330 token/s and 30 token/s for smaller Gemma 2B, respectively. When compared with CPU-based FastLLM and GPU-based MLC-LLM, the Transformer-Lite engine attains over 10x speedup for prefill speed and 2~3x speedup for the decoding speed.

Overall, the Transformer-Lite engine, along with the proposed optimization techniques, enables efficient deployment of large language models on mobile GPUs, improving the overall performance and user experience. Moreover, it is compatible with both Qualcomm and MTK processors, making it a practical solution for deploying LLMs on a wide range of mobile devices.

Don't Use Your Data All at Once: Introducing Stepwise DPO

Authors: Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park

Source and references: https://arxiv.org/abs/2403.19270

A New Approach to Tuning Large Language Models

As large language models (LLMs) continue to revolutionize natural language processing (NLP), aligning them with human preferences has become increasingly important. Current strategies like reinforcement learning and direct preference optimization (DPO) simplify the LLM training process. However, they still face challenges, especially when using proprietary models like GPT-4 as reference models. Researchers have now proposed a more practical solution—stepwise DPO (sDPO).

The Problem with Reference Models

Typically, DPO involves comparing log probabilities of chosen and rejected responses, using human or strong AI judgment to curate preference datasets. But getting these log probabilities with models like GPT-4 can be challenging, as they don't provide that information. In most cases, the reference model is set as the base Supervised Fine-Tuning (SFT) model, which can be weaker and potentially misaligned.

A more aligned reference model could result in better alignment tuning. However, using open-source models that have undergone alignment tuning comes with safety concerns, as they may not exist or control over the reference model could lead to potential safety issues. To address this, researchers have devised the sDPO approach.

Introducing Stepwise DPO (sDPO)

The core idea behind sDPO is to divide available preference datasets and use them in a stepwise manner, rather than all at once. With this method, the aligned model from a previous step is used as the reference model for the next step. Consequently, a more aligned reference model—or better lower bound—is employed, which results in better alignment tuning.

The primary benefit of sDPO is that it produces a more performant final aligned model than other popular LLMs. Furthermore, the approach can be easily applied to any preference data, making it extremely versatile and complementary to existing methods.

Investigating the Importance of Reference Models

Preliminary experiments were conducted using datasets like Ultrafeedback and OpenOrca on LLMs such as Mistral-7B-OpenOrca and OpenHermes-2.5-Mistral-7B. Comparing different reference models—such as the SFT base model, SOLAR-0-70B, and Intel-7B-DPO—highlighted the significant role of pre-aligned models in tuning the final aligned model's performance.

However, simply adopting open-source pre-aligned models as reference models may not be safe or feasible due to technical and safety concerns. To address this, the researchers propose the sDPO method, ensuring each step uses a more aligned reference model to improve the target model.

How sDPO Works

In sDPO, reference models are more aligned as they progress, resulting in a more strict lower bound on log probabilities. This approach induces curriculum learning, where the target model is first optimized to adhere to easy tasks and then gradually moves on to more challenging ones.

To evaluate the effectiveness of sDPO, experiments were conducted on SOLAR 10.7B models using preference datasets, OpenOrca and Ultrafeedback Cleaned. The evaluation showed that employing sDPO resulted in a higher H4 score compared to conventional DPO models. Moreover, the specific way of splitting the available DPO data into multiple datasets had a significant impact on performance.

Analyzing the Effectiveness and Limitations of sDPO

The reference models in sDPO demonstrate increased alignment, which leads to a more performant aligned model. However, adopting open-source models as reference models can be dangerous due to possible overlaps in training datasets. Researchers recommend sDPO as an alternative to ensure an unbiased and safe training process.

The sDPO method undoubtedly offers promise, but it also has some limitations. For instance, the optimal strategy for segmenting more complex DPO datasets remains unclear. Furthermore, expanding the experimental framework to include a wider range of LLMs could provide a broader understanding of the strengths and limitations of sDPO.

Looking Ahead

Stepwise DPO presents a practical, efficient, and impactful alternative to LLM alignment tuning, in contrast to using conventional DPO techniques. By using preference data in a stepwise manner, models become more performant and significantly closer to human preferences. Further exploration of this method is necessary, especially in segmenting complex DPO datasets and evaluating LLMs across various tasks. However, sDPO has already proven its potential as a valuable addition to the language model training toolkit, garnering attention from the AI community and offering promise for future developments.