StarCoder 2, Byte Models, 1-bit LLMs, EMO & Sora

Week 1, March 2024

Mar 04, 2024

∙ Paid

Greetings,

Welcome to the 48th edition of the State of AI. This time, we explore StarCoder 2 and The Stack v2, ushering in the next generation of technology. Dive into byte models that simulate digital worlds beyond language limitations, embrace the era of 1-bit LLMs, and discover EMO - generating expressive portrait videos from audio. Finally, join us as we review Sora, examining large vision models' background, technology, limitations, and opportunities. Prepare for a captivating journey through the cutting edge of AI research.

Best Regards,

State of AI

StarCoder 2 and The Stack v2: The Next Generation
Beyond Language Models: Byte Models are Digital World Simulators
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with
Audio2Video Diffusion Model under Weak Conditions
Sora: A Review on Background, Technology, Limitations, and Opportunities
of Large Vision Models

StarCoder2 and The Stack v2: The Next Generation

Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada, Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii, Nii Osae Osae Dade, Wenhao Yu, Lucas Krauß, Naman Jain, Yixuan Su, Xuanli He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang, Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet, Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary, Nima Tajbakhsh, Yacine Jernite, Carlos Muñoz Ferrandis, Lingming Zhang, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries

Source and references: https://arxiv.org/abs/2402.19173v1

The Next Chapter in Code LLMs

As the importance of Large Language Models for Code (Code LLMs) grows, researchers are continually seeking to improve their performance and capabilities. The BigCode project, an open-scientific collaboration, introduces an upgraded version of StarCoder, called StarCoder2. The project also unveils The Stack v2, built on the extensive Software Heritage archive, alongside carefully selected high-quality data sources like GitHub pull requests, Kaggle notebooks, and code documentation.

What makes this project so promising is how it trains and evaluates Code LLMs with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens. As a result, the new StarCoder2 models significantly outperform their predecessors and compete with the other leading models in most benchmarks.

What's New in StarCoder2 and The Stack v2

StarCoder2 and The Stack v2 deliver improvements at multiple levels, from data collection and training to performance benchmarking.

Getting More from the Data

The Stack v2 digs deep into Software Heritage's impressive archive, reaching over 600 programming languages, along with additional sources like GitHub issues, pull requests, Jupyter notebooks, and code documentation. Fine-tuning the data extraction process involved deduplication, removing low-quality code, redacting Personally Identifiable Information, and filtering out malicious code.

The result is a training set with more than 900 billion unique tokens, four times larger than the original StarCoder dataset.

Data Governance & Transparency

One of the primary goals of the BigCode project is promoting openness and transparency. This approach ensures the community can readily reuse each other's work, check for biases, and gain a better understanding of the data used to train these models. The Stack v2 offers a governance tool for developers to check if their source code is included in the dataset and allows them to opt-out if desired.

Improved Training Process

To train the next-generation models, the BigCode project adopted a two-stage training process with context windows of 4k and 16k. The training does not exceed more than 5 epochs over the dataset but pushes the number of training tokens well beyond the compute-optimal level suggested by prior research.

A Game-Changer for Code LLMs

StarCoder2 models have posed impressive results in a comprehensive set of Code LLM benchmarks, demonstrating a remarkable lead over most other Code LLMs:

StarCoder2-3B surpasses other models of similar size and even matches or outperforms StarCoderBase-15B.
StarCoder2-15B significantly outperforms models of comparable size and matches or beats CodeLlama-34B.
For low-resource languages and math or code reasoning benchmarks, StarCoder2-15B consistently outperforms DeepSeekCoder-33B.

Applying StarCoder2

As leading Code LLMs show promising performance improvements, we can expect their impact on real-world applications to grow. Code LLMs will potentially enhance all phases of the software development cycle, such as project implementation, quality assurance, bug detection and fixing, maintenance tasks, and migration to new software versions. With an open model like StarCoder2, the possibilities for adoption and adaptation by developers are endless.

An Important Step Forward

The development of StarCoder2 and The Stack v2 represents a significant milestone in the BigCode project’s journey to create more capable and transparent Large Language Models for Code. While impressive results have already been achieved, the field of Code LLMs is ever-evolving, and we can expect even more breakthroughs in the coming years. With research pushing the limits of innovation and a commitment to open science, these models will continue to transform the way developers write, edit, and maintain code in the future.

Beyond Language Models: Byte Models are Digital World Simulators

Authors: Shangda Wu, Xu Tan, Zili Wang, Rui Wang, Xiaobing Li, Maosong Sun

Source and references: https://arxiv.org/abs/2402.19155

Introduction and Motivation

In the world of deep learning, we're used to seeing models that handle human-interpretable information like text, audio, and images. Recent advancements in natural language processing have spurred interest in building more advanced language models (LMs) such as GPT, capable of understanding complex patterns in textual data. But have you ever wondered what if we tried to model the digital world at a more fundamental level?

Meet bGPT, a new model that focuses on bytes, the building blocks of the digital world. In contrast to traditional deep learning models that work with text, audio, or images, bGPT directly processes binary data, offering a more intrinsic understanding of the digital realm and setting the stage for a paradigm shift in how deep learning models operate.

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.