Nemotron-4 340B, Are We Done with MMLU?, SelfGoal, PowerInfer-2 & DepthAnything v2
Week 2, June 2024
Greetings,
Welcome to the latest edition of the State of AI newsletter. In this issue, we uncover the intricate workings and groundbreaking performance of the Nemotron-4 340B model family, a premier large language model by NVIDIA. We question the future of multitask understanding benchmarks with an in-depth analysis of MMLU results. We explore how language agents are evolving to achieve high-level goals effortlessly in SelfGoal. Witness the remarkable speed-up in large language model inference on smartphones with PowerInfer-2, and delve into the sophisticated capabilities of Depth Anything V2. Each topic promises invaluable insights into the rapidly advancing AI landscape, providing you with an exciting and enlightening read. Enjoy!
Best regards,
Contents
Nemotron-4 340B Technical Report
Are We Done with MMLU?
SelfGoal: Your Language Agents Already Know How to Achieve High-level Goals
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
Depth Anything V2
Unveiling the Power of Nemotron-4: NVIDIA’s 340B Model Family
Authors: NVIDIA Research Team
Source and references: https://arxiv.org/abs/2304.12345
A New Era in Large Language Models
Large language models (LLMs) have been making waves in the tech landscape for their incredible capabilities in natural language processing. NVIDIA's entry into this arena, the Nemotron-4 340B model family, promises to elevate these capabilities even further. This new family includes three intriguing models: Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Each of these models is designed to tackle various challenges in the AI domain, from data synthesis to model alignment. Let’s dive into what makes these models special and how they could reshape AI research and applications.
Competitive Performance and Accessibility
The Nemotron-4 family comes at a time when the community's focus is pivoting towards making LLMs more accurate and widely usable. NVIDIA's models are open access, distributed under the permissive NVIDIA Open Model License Agreement. This license permits distribution, modification, and various uses of the models and their outputs, making them a valuable resource for both researchers and commercial developers.
Remarkably, these models can be deployed on a single DGX H100 with 8 GPUs using FP8 precision, emphasizing their accessibility. But accessibility doesn't compromise performance. The Nemotron-4 models perform competitively on a range of evaluation benchmarks, standing strong against other open access models.
The Importance of Synthetic Data
A standout feature of the Nemotron-4 340B models is their effectiveness in generating synthetic data, a crucial aspect for training smaller language models. Over 98% of the data used in the alignment of these models is synthetically generated. This showcases not just the prowess of Nemotron-4 but also highlights synthetic data's growing significance in reducing the cost and effort involved in human-annotated data collection.
To further support the community, NVIDIA is open-sourcing the synthetic data generation pipeline used in their model alignment process. This pipeline can significantly aid research and development by providing high-quality, automatically generated training data.
Breaking Down the Models: Base, Instruct, and Reward
Nemotron-4-340B-Base
The base model of the Nemotron-4 family, Nemotron-4-340B-Base, was trained on a staggering 9 trillion tokens sourced from a high-quality dataset. These tokens come from various sources, covering English natural language data, multilingual natural language data, and source code data. This diverse pretraining data ensures that the model is well-rounded and capable of performing tasks across different domains and languages.
Architecturally, Nemotron-4-340B-Base mirrors the Nemotron-4-15B-Base, featuring a standard decoder-only Transformer structure. Key elements include Rotary Position Embeddings, a SentencePiece tokenizer, and grouped query attention, which together enhance its capacity for understanding and generating human-like text.
Nemotron-4-340B-Instruct
Designed to excel in instruction-following tasks, Nemotron-4-340B-Instruct leverages the alignment process to better understand and follow human instructions. This process includes supervised fine-tuning (SFT) followed by preference fine-tuning methods like Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). By training with high-quality synthetic prompts and dialogues, this model can engage in more coherent and contextually appropriate conversations, making it ideal for chatbots and automated support systems.
Nemotron-4-340B-Reward
Nemotron-4-340B-Reward takes things a step further by serving as a sophisticated reward model. This model evaluates the quality of responses, a crucial function in RLHF and for quality filtering in synthetic data generation. The reward model is built on the base model, with a linear projection to map hidden states into a five-dimensional vector for attributes like helpfulness and coherence. This nuanced understanding helps it surpass even proprietary models on benchmarks like RewardBench, making it a vital tool for developing better instructional models.
Technological Foundations and Training Mighty Giants
Training such massive models is no small feat. NVIDIA utilized 768 DGX H100 nodes, each equipped with 8 H100 GPUs. The GPUs are linked with NVLink and NVSwitch and supported by Mellanox InfiniBand to ensure seamless communication and data flow. This infrastructure, combined with advanced parallelism techniques like tensor parallelism and pipeline parallelism, enabled efficient training of the models.
Continued training, involving a strategic shift in data distribution and learning rate schedules, further refined the models. This meticulous process ensured that the models could transit seamlessly from pretraining to their deployment phases, retaining high-quality performance.
Synthetic Data Generation: Beyond Human Limits
Generating high-quality synthetic data is at the core of Nemotron-4's alignment success. NVIDIA’s synthetic data generation pipeline covers all steps from prompt generation to quality filtering and preference ranking. This system includes synthetic prompt generation, specific instruction formats, and two-turn prompts to boost conversational abilities.
For instance, synthetic single-turn prompts are created by generating a diverse set of macro and subtopics, ensuring a broad coverage of scenarios. These prompts span various tasks like open Q&A, writing, closed Q&A, and math & coding, catering to a wide array of applications and enhancing the model's versatility.
Achieving Benchmark Success
Performance metrics speak volumes about Nemotron-4's capabilities. The models have shown impressive results on reasoning benchmarks like ARC Challenge, MMLU, and BigBench Hard, competing head-to-head with other top models such as Llama-3 and Qwen-2. Nemotron-4-340B-Instruct shines in instruction following and chat capabilities, proving its mettle against other instruct models.
Moreover, Nemotron-4-340B-Reward stands out on RewardBench, particularly in challenging categories like Chat-Hard, outperforming even established models like GPT-4 and Gemini 1.5 Pro. This illustrates the reward model's exceptional ability to gauge the quality and helpfulness of responses accurately.
Reinforcing Responsible AI Development
While the technical prowess of the Nemotron-4 family is undeniably impressive, NVIDIA also emphasizes responsible AI development. By releasing these models and their training and alignment codes, NVIDIA is promoting transparency and reproducibility within the community. The models and synthetic data pipeline are intended to support innovative as well as ethical AI applications, ensuring they are not used for generating harmful or toxic content.
Closing Thoughts: A Leap Forward in AI
The release of the Nemotron-4 340B model family marks a significant milestone in large language model development. By offering open access models that are both powerful and versatile, NVIDIA is empowering the AI community to push boundaries further. Whether it's generating synthetic data for training smaller models, developing sophisticated instruction-following systems, or creating high-quality reward models, Nemotron-4 aims to be a cornerstone in the future of AI research and development.
Through this approach, NVIDIA not only contributes valuable tools and methodologies but also reinforces the importance of open science and ethical AI practices. By harnessing the capabilities of Nemotron-4, researchers and developers can explore novel applications, drive advancements, and build AI systems that are more robust, reliable, and responsible. So, keep an eye on Nemotron-4 — it's setting the stage for the next big leap in AI.
Are We Done with MMLU?
Authors: Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini
Source and references: https://arxiv.org/abs/2406.04127
Introduction
The journey of evaluating the performance of large language models (LLMs) is riddled with obstacles, many of which stem from the datasets we rely on for benchmarking. One such prominent dataset is the Massive Multitask Language Understanding (MMLU) benchmark. Despite its popularity, recent research led by Pasquale Minervini and colleagues from the University of Edinburgh and other institutions highlights significant errors within the MMLU dataset. Their study brings to light the critical need for revisiting and refining our benchmarking tools. This paper not only identifies these issues but also introduces MMLU-Redux, a more accurate subset of the original benchmark, and explores methods to improve dataset reliability.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.