Greetings,
Welcome to the 49th edition of the State of AI. In this issue, we explore the frontiers of multimodal understanding with Gemini 1.5, delve into the world of open foundation models with Yi by 01.AI, and ponder the possibilities of automating front-end engineering with Design2Code. We also discuss the memory-efficient LLM training technique, GaLore, and introduce VisionLLaMA, a unified interface for vision tasks using the LLaMA model.
Each of these topics showcases the relentless progress and innovative applications of AI, promising a captivating and informative read. We hope you find this edition both engaging and thought-provoking.
Best regards,
Contents
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Yi: Open Foundation Models by 01.AI
Design2Code: How Far Are We From Automating Front-End Engineering?
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Unlocking Multimodal Understanding Across Millions of Tokens of Context
Authors: Gemini Team, Google
Source and references: https://arxiv.org/abs/2403.05530
Introduction
The latest breakthrough in the world of AI is Gemini 1.5 Pro, a multimodal model developed by the Gemini Team at Google. This state-of-the-art model is designed to handle extremely long contexts, extending the frontier of language model context lengths to millions of tokens. This means that Gemini 1.5 Pro can process entire collections of documents, multiple hours of video, and even several days' worth of audio. Its performance rivals that of Gemini 1.0 Ultra, all while requiring significantly less compute resources and providing new capabilities like in-context learning from entire long documents.
A New Kind of AI Model
Gemini 1.5 Pro is built on a sparse mixture-of-experts (MoE) Transformer architecture, making it particularly suited to handling long contexts. The model benefits from recent research advances in routing functions that direct inputs to a subset of the model's parameters for processing. This allows the model to grow its total parameter count while maintaining a constant activation of parameters for any given input.
The authors have made improvements across nearly every aspect of the model stack, including architecture, data, optimization, and systems. With these enhancements, Gemini 1.5 Pro achieves comparable quality to its predecessor, Gemini 1.0 Ultra, with significantly lower training compute and a more efficient serving capacity.
Evaluating Gemini 1.5 Pro
The evaluation of Gemini 1.5 Pro covers three main categories: qualitative long-context multimodal evaluations, quantitative long-context multimodal evaluations, and quantitative core evaluations. This lets researchers measure Gemini 1.5 Pro's abilities in both synthetic and real-world tasks across all three modalities, including text, vision, and audio.
Exploring Novel Capabilities
Some of the unique interactions observed with Gemini 1.5 Pro include:
Ingesting large codebases, such as the JAX library, and answering specific queries about them.
Learning to translate a new language, like Kalamang, from a single set of linguistic documentation.
Identifying and locating famous scenes from literature (e.g., Les Misérables) based on hand-drawn sketches.
Answering questions about entire movies by processing them one frame at a time.
Probing Long-Context Abilities
Researchers conducted various diagnostic-focused probing studies to understand Gemini 1.5 Pro's long-context capabilities, such as measuring perplexity over long sequences and conducting needle-in-a-haystack retrieval tests. Gemini 1.5 Pro's recall capabilities surpass those of other state-of-the-art models like Claude 2.1 and GPT-4 Turbo, achieving near-perfect recall (>99%) up to at least 10 million tokens.
These diagnostic evaluations provide valuable insights into how the model can effectively use very long contexts to improve next-token predictions. The authors observed a power-law trend in cumulative average negative log-likelihood (NLL) for long documents and code up until 1 million and 2 million tokens, respectively.
Realistic Evaluations of Long-Context Capabilities
In addition to synthetic tasks, the authors designed realistic evaluations that required Gemini 1.5 Pro to retrieve and reason over multiple parts of long contexts. These evaluations included long-document question-answering, long-context automatic speech recognition, learning to translate a new language from a single book, and long-context video question-answering. In each case, Gemini 1.5 Pro outperformed all competing models, even those augmented with external retrieval methods.
Gemini 1.5 Pro's Core Capabilities
Although Gemini 1.5 Pro excels in long-context performance, this doesn't come at the expense of its core multimodal capabilities. In fact, it vastly surpasses Gemini 1.0 Pro in the majority of core benchmarks, including math, science, reasoning, multilinguality, video understanding, image understanding, and code. Most notably, Gemini 1.5 Pro performs better than Gemini 1.0 Ultra in more than half of the core benchmarks, despite using less training compute and being more efficient to serve.
Conclusion
Gemini 1.5 Pro represents a significant breakthrough in AI research, boasting unprecedented capabilities in handling long contexts across multiple modalities. The authors' comprehensive evaluation of the model showcases its impressive performance in both synthetic and real-world tasks, as well as its ability to learn new skills and knowledge on-the-fly. As our understanding of long-context capabilities continues to grow, further research is needed to investigate this rapidly evolving frontier and to uncover even more exciting applications in AI.
Yi: Open Foundation Models by 01.AI
Authors: 01.AI Team
Source and references: https://arxiv.org/abs/2403.04652v1
Introducing the Yi Model Family
The world of artificial intelligence has seen remarkable breakthroughs thanks to large-scale language models. The team at 01.AI has introduced the Yi model family, a series of language and multimodal models with strong multi-dimensional capabilities. The aim is to make these models the next generation computational platform and provide the community with amplified intelligence.
Yi model family is based on 6B and 34B pretrained language models. These models are then extended to chat models, 200K long context models, depth-upscaled models, and vision-language models. The base models achieve strong performance on benchmarks like MMLU, and finetuned chat models deliver impressive human preference rates on evaluation platforms like AlpacaEval and Chatbot Arena.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.