Welcome to the insightful seventh edition of the State of AI newsletter, your definitive guide to the latest AI developments. This edition places a special emphasis on large language models (LLMs) - their capabilities, constraints, and future prospects.
We commence with an examination of proprietary LLMs, focusing on distinguishing hype from reality, and continue to an exciting interface of LLMs with APIs. As LLMs evolve, the potential of their interaction with APIs has been a focal point of research and will form an essential part of our discussion.
Our journey takes us further into the domain of video sequences and their interpretation by LLMs, a field growing in importance given the rising prevalence of video content. Subsequently, we address an intriguing possibility: Can LLMs critique and correct themselves? This fascinating aspect of self-corrective models promises to add a new dimension to our understanding of AI's capabilities.
Lastly, we delve into the role of LLMs in code understanding and generation, a critical area as coding becomes increasingly integrated into our digital lives. This edition, therefore, serves as a comprehensive overview of the ongoing revolution in the world of AI, spotlighting the dynamic and evolving landscape of LLMs. We invite you to join us in these thought-provoking discussions.
Best regards,
Contents
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Gorilla: Large Language Model Connected with Massive APIs
VideoLLM: Modeling Video Sequence with Large Language Models
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing
The False Promise of Imitating Proprietary LLMs
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Authors: Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi D.Q. Bui, Junnan Li, Steven C.H. Hoi
Source & References: https://arxiv.org/abs/2305.07922
Introduction
Recent years have seen the rise of large language models (LLMs) in the code domain, demonstrating remarkable success on a wide array of code understanding and generation tasks. However, existing LLMs, such as encoder-only or decoder-only models, often exhibit limitations on flexibility and applicability to various tasks. To address these limitations, the authors of CodeT5+ introduce a family of encoder-decoder-based LLMs to accommodate a wide range of downstream code tasks.
Motivation behind CodeT5+
The authors identify two key limitations in existing code LLMs: the model architecture and pretraining tasks. Encoder-only models perform well in understanding tasks, while decoder-only models excel in generative tasks, but these specialized models often don't work well in other scenarios. Unified encoder-decoder models have been introduced to handle both types of tasks, but their performance on certain tasks remains suboptimal.
Moreover, most code LLMs use a limited set of pretraining objectives, which might not suit some downstream tasks, leading to subpar performance. CodeT5+ aims to tackle these issues by introducing a flexible combination of component modules, a mixture of pretraining objectives, and efficient initialization.
CodeT5+ Architecture and Pretraining Strategy
The CodeT5+ model consists of an encoder and decoder based on the Transformer architecture. During the pretraining process, the model goes through two stages: unimodal pretraining on code data and bimodal pretraining on text-code data.
Unimodal Pretraining on Code Data:
In the unimodal pretraining stage, the model learns from large-scale code data using a mixture of tasks: span denoising and causal language modeling (CLM). These tasks enable the model to learn contextual representations from code data and adapt to different downstream code tasks.
Span denoising: The span denoising task is similar to that in T5, where the models randomly replace some tokens with sentinel tokens and train the model to recover these spans.
Causal language modeling (CLM): The authors introduce two variants of CLM: a decoder-only generation task and a sequence-to-sequence causal LM objective.
Bimodal Pretraining on Text-code Data:
In the bimodal pretraining stage, the model continues training on a combination of code and text data using cross-modal learning objectives like contrastive learning, matching, and CLM tasks.
Contrastive learning: This task helps the model learn better unimodal representations and fine-grained text-code alignments, which in turn allows for improved retrieval performance.
Efficient Initialization and Instruction Tuning
To scale up the model size of CodeT5+ efficiently, the authors initialize the components of CodeT5+ with off-the-shelf code LLMs. They employ a "shallow encoder and deep decoder" architecture and only train a small subset of parameters while freezing the rest of the model.
Inspired by recent advances in instruction tuning, the authors also explore aligning CodeT5+ with natural language instructions using continuous and discrete prompts, allowing for better adaptation to downstream tasks.
Model Evaluation and Results
CodeT5+ was extensively evaluated on over 20 code-related benchmarks in various settings, such as zero-shot, finetuning, and instruction tuning. In many cases, the CodeT5+ framework demonstrated substantial performance gains over state-of-the-art baselines, such as improvements in text-to-code retrieval tasks, line-level code completion, retrieval-augmented code generation, and math programming tasks.
Interestingly, in the zero-shot text-to-code generation task on the HumanEval benchmark, the instruction-tuned CodeT5+ 16B model set new state-of-the-art results, even surpassing the performance of the closed-source OpenAI code-cushman-001 model.
Conclusion
In summary, CodeT5+ addresses significant limitations in existing code LLMs by introducing a family of flexible encoder-decoder models and a diverse mixture of pretraining tasks. By leveraging an efficient initialization strategy and instruction tuning, CodeT5+ demonstrates impressive performance gains over state-of-the-art baselines across various code-related benchmarks.
The authors provide insights on scalable model initialization, pretrain-finetune discrepancy mitigation, and achieving state-of-the-art results on a wide range of tasks. By open-sourcing the CodeT5+ models, the research community can further develop and refine these code LLMs, paving the way for even more impressive achievements in the code generation and understanding domain.
The False Promise of Imitating Proprietary LMs
Authors: Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song
Source & References: https://arxiv.org/abs/2305.15717
Introduction
Language models (LMs) have become an essential part of modern AI systems, with powerful proprietary models like ChatGPT, Bard, and Claude dominating the market. In contrast, open-source LMs like LLaMA and FLAN-T5 continue to play catch-up, albeit at a lower level of performance. This raises an important question about the future: will the best LMs be closed-source or freely available for public use?
A paper by researchers at UC Berkeley investigates this question, focusing on the technique of model imitation. This method involves collecting data via proprietary LM APIs and fine-tuning open-source LMs to emulate the performance of their closed-source counterparts.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.