Greetings,
Welcome to the 24th edition of the State of AI. As we navigate the forward march of technology, this issue beckons readers to explore the continued evolution of learning through “Textbooks Are All You Need II: phi-1.5 technical report,” the ambitious strides in multimodal LLMs with "NExT-GPT," and the synergy between evolutionary algorithms and large language models in crafting robust prompt optimizers. We also venture into the mesmerizing realm of "Generative Image Dynamics" and introduce the groundbreaking "Agents" framework, shaping the future of autonomous language agents.
Each topic underscores the ever-expanding horizons of AI, offering profound insights and novel discoveries. Immerse yourself in these innovations, and let your curiosity guide you. Enjoy!
Best regards,
Contents
Textbooks Are All You Need II: phi-1.5 technical report
NExT-GPT: Any-to-Any Multimodal LLM
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Generative Image Dynamics
Agents: An Open-source Framework for Autonomous Language Agents
Textbooks Are All You Need II: phi-1.5 Technical Report
Authors: Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, Yin Tat Lee
Source & References: https://arxiv.org/abs/2309.05463v1
Introduction
The team at Microsoft Research has continued their exploration of the power of smaller, transformer-based language models. The aim is to achieve high-level capabilities with fewer parameters, thus helping democratize advanced AI technology, lower energy consumption, and make experimentation more manageable.
In this paper, they introduce phi-1.5, a 1.3 billion parameter model that performs natural language tasks comparable to models five times larger, and even surpasses them in more complex reasoning tasks like grade-school mathematics and basic coding. Unlike larger models, phi-1.5 is almost exclusively trained on high-quality, "textbook-like" synthetic data, which seems to mitigate issues concerning toxic and biased content generation.
Technical Specifications
The architecture of phi-1.5 is identical to phi-1, with 24 layers, 32 heads, and each head having a dimension of 64. The researchers used a dataset of 30 billion tokens (a combination of phi-1´s training data and new synthetic data) aimed at teaching common sense reasoning and general knowledge of the world.
To understand the importance of traditional web data, they created two additional models: phi-1.5-web-only and phi-1.5-web, which are trained on filtered web data.
While phi-1.5-web-only was trained purely on filtered web data, phi-1.5-web was trained on a mix of their datasets. Both models showed similar performance, proving the value of high-quality synthetic data in training.
Benchmark Results
The performance of phi-1.5 was evaluated across several language tasks, such as common sense reasoning, language understanding, mathematics, and coding. It achieved comparable results to Llama2-7B, Falcon-7B, and Vicuna-13B in common sense reasoning and language understanding tasks, even outperforming them in multi-step reasoning tasks like coding.
Against other similarly sized models, the versions of phi-1.5 trained along with web data showed improved performance, making a strong case for their use of synthetic data.
Addressing Toxicity and Biases
Toxic and biased content generation is a challenge for language models, and Microsoft Research aimed to minimize these issues by training phi-1.5 with textbook-like synthetic data.
Models trained with traditional web data exhibit harmful responses to certain prompts, leading to biased completion. On the other hand, phi-1.5 performed more safely and ethically in similar cases, providing a more reliable alternative.
Conclusion
By focusing on common sense reasoning in natural language and using synthetic data, the creators of phi-1.5 have built a powerful model that surpasses most non-frontier language models in more complex reasoning tasks. Furthermore, the use of synthetic data has demonstrated potential benefits in mitigating the generation of toxic and biased content.
This research not only emphasizes the value of high-quality synthetic data in training but also highlights the importance of reducing the parameter count of AI models. Smaller models like phi-1.5 are more manageable, energy-efficient, and accessible, offering a promising direction for future AI research and development.
Impact
The introduction of phi-1.5 is an important step forward for AI, as it addresses the challenges of democratizing advanced technology while mitigating important issues such as toxicity and biases. With continued research and the open-source release of phi-1.5, the AI community can work towards more sustainable, efficient, and ethical AI solutions for natural language tasks and beyond.
By making phi-1.5 open-source, Microsoft Research is promoting further research on urgent topics like in-context learning, interpretability, and mitigation strategies for hallucinations, toxic content generation, and biased outputs.
Ultimately, this paper reinforces the idea that quality training data, rather than merely increased scale and resources, is vital for creating powerful and ethical AI models.
NExT-GPT: Any-to-Any Multimodal LLM
Authors: Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua
Source & References: https://arxiv.org/abs/2309.05519
Introduction
The recent progress of Artificial Intelligence Generated Content (AIGC) has been impressive with the development of technologies such as ChatGPT for text generation and diffusion models for visual generation. Large Language Models (LLMs) like Flan-T5, Vicuna, LLaMA, and Alpaca have demonstrated incredible language reasoning and decision-making capabilities. Our world is inherently multimodal, and as a result, a lot of research has gone into developing multimodal LLMs (MM-LLMs), which can understand different modalities such as text, images, videos, and audio. However, most of the work falls short of achieving full any-to-any modality conversions, making exploration of any-to-any MM-LLMs essential for human-level artificial intelligence.
To address this gap, the authors present an end-to-end any-to-any MM-LLM system called NExT-GPT. NExT-GPT connects LLMs with multimodal adaptors and different diffusion decoders, enabling the system to perceive and generate content in different combinations of text, images, videos, and audio. By leveraging the well-trained high-performing encoders and decoders, NExT-GPT allows for low-cost training and expansion of more potential modalities.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.