Greetings,
Welcome to the 21st edition of the State of AI. This time, we embark on a journey through the convergence of code and AI with Code Llama, and the innovative idea of bringing models to life from mere natural language instructions via Prompt2Model. We also introduce SoTaNa, a game-changer in open-source software development assistance. Dive deep into the fusion of text and realistic human imagery with TeCH and witness the marvel of Dual-Stream Diffusion Net which bridges the gap between text and dynamic video content.
These selected topics not only reflect the evolving nature of AI but also highlight its intertwining with human-computer interaction, software, and multimedia. Prepare for an enlightening and forward-looking exploration. Enjoy!
Best regards,
Contents
Code Llama: Open Foundation Models for Code
Prompt2Model: Generating Deployable Models from Natural Language Instructions
SoTaNa: The Open-Source Software Development Assistant
TeCH: Text-guided Reconstruction of Lifelike Clothed Humans
Dual-Stream Diffusion Net for Text-to-Video Generation
Code Llama: Open Foundation Models for Code
Authors: Baptiste Rozière†, Jonas Gehring†, Fabian Gloeckle†,∗, Sten Sootla†, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, Gabriel Synnaeve†
Source & References: https://ai.meta.com/blog/code-llama-large-language-model-coding/
Introduction
Code Llama is a family of large language models for code generation and infilling based on Llama 2. It provides state-of-the-art performance for open models, support for larger input contexts, and zero-shot instruction following capabilities for programming tasks. There are three primary variants, each in multiple sizes: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, and 34B parameters each. All models are trained on sequences of 16k tokens and demonstrate remarkable improvements on inputs with up to 100k tokens.
Code Llama Pipeline
The authors use a cascade of training and fine-tuning steps to create the Code Llama models. They start by fine-tuning Llama 2 foundation models on a dataset predominantly containing publicly available code, as well as 8% natural language data related to code. This creates the foundation for the Code Llama models. Next, they fine-tune the models on Python-focused datasets for the Code Llama - Python variant. Finally, they perform instruction fine-tuning on a mix of proprietary instruction data and a new machine-generated self-instruct dataset to create the Code Llama - Instruct variant.
Infilling and Long Context Fine-Tuning
The researchers train 7B and 13B Code Llama models on a multitask objective, which includes both autoregressive and causal infilling prediction. This enables code completion and docstring generation in real-time.
For improved performance on longer input contexts, the models undergo an additional fine-tuning stage, extending the maximum context length from 4,096 to 100,000 tokens by modifying the parameters of the RoPE positional embeddings used in Llama 2. This enables repository-level reasoning for code completion and synthesis instead of just function-level or file-level capabilities.
Instruction Fine-Tuning and Datasets
Code Llama - Instruct models are further fine-tuned with proprietary instruction data and a novel self-instruct dataset generated by prompting Llama 2 for coding problems and using Code Llama to generate unit tests and solutions. In addition to improving instruction-following capabilities, this also enhances safety and helps to prevent unsafe, toxic, or biased generations.
Benchmark Performance
Code Llama models establish a new state of the art in various code generation benchmarks among open-source large language models. They demonstrate impressive performance on HumanEval, MBPP, and APPS, as well as MultiPL-E, a multilingual version of HumanEval.
Training Details and Optimization
The optimization process used for training Code Llama models is based on AdamW optimizer with β1 and β2 values of 0.9 and 0.95, using a cosine schedule and a batch size of 4M tokens. The learning rates of the Code Llama - Instruct models are determined by the size of the model, with larger models having lower learning rates.
During the long context fine-tuning stage, the authors change the base period from which rotation frequencies of rotary position embeddings are derived, allowing for processing larger sequences and reducing bias towards short-distance attention. The fine-tuning process is optimized for the various sizes of Code Llama models, each with different learning rates and gradient steps, while maintaining stability and efficiency.
Conclusion
Code Llama offers an exciting new suite of large language models specifically designed for code generation and infilling tasks in programming contexts. With specialized Python models, long context fine-tuning, and instruction-enhanced capabilities, the Code Llama family is poised to make significant contributions to the field of AI-powered coding assistance. By achieving state-of-the-art performance on multiple benchmarks, it showcases the vast potential held by fine-tuned language models when applied to domain-specific tasks such as code generation and instruction-following.
Prompt2Model: Generating Deployable Models from Natural Language Instructions
Authors: Vijay Viswanathan, Chenyang Zhao, Amanda Bertsch, Tongshuang Wu, Graham Neubig
Source & References: https://arxiv.org/abs/2308.12261
Introduction
Have you ever wanted to create an AI system that's easy to implement just by describing your task in natural language? Say hello to "Prompt2Model", an innovative method that combines the benefits of large language models (LLMs) and specialized, deployable NLP models. Prompt2Model allows developers to describe the desired AI interface in natural language and then generates a specialized, deployment-ready model. It achieves this by automating the entire machine learning pipeline, from data retrieval and generation to model training and evaluation.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.