GPT-4V (ision), Language Modeling Is Compression, LongLoRA, DreamLLM and more 🚀
Week 4, September 2023
Greetings,
Welcome to the landmark 25th edition of the State of AI. This edition offers a panoramic view of the AI landscape, highlighting OpenAI's cutting-edge multimodal language model with GPT-4V(ision), the revolutionary advancements in efficiently fine-tuning long-context models with LongLoRA, the profound paradigm that showcases how language modeling essentially embodies compression, the innovative DreamLLM that encapsulates synergistic multimodal comprehension and creation, and lastly, diving into the expanse of languages with CulturaX, a comprehensive dataset representing 167 languages.
These segments symbolize the broadening horizons and sheer potential of AI. We invite you to immerse yourself in these breakthroughs and ponder the possibilities. Enjoy!
Best regards,
Contents
GPT-4V (ision): A Look into OpenAI's Multimodal Language Model
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Language Modeling Is Compression
DreamLLM: Synergistic Multimodal Comprehension and Creation
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
GPT-4V (ision): A Look into OpenAI's Multimodal Language Model
Authors: OpenAI
Source & References: https://cdn.openai.com/papers/GPTV_System_Card.pdf
Introduction
The ever-growing landscape of AI research has led OpenAI to develop a multimodal language model, GPT-4 with Vision (GPT-4V). By accepting image inputs, this innovative technology opens up new possibilities for how language models can be utilized, reaching untapped potential across various domains.
Following an iterative deployment approach that began in 2022, GPT-4V saw its training process boosted with reinforcement learning from human feedback (RLHF). With the model's debut in March 2023, a robust system of evaluations and external expert insights contributed to its preparation for widespread use, addressing a myriad of safety concerns that come with combining text and vision modalities.
Deployment Preparation: Early Access Insights
OpenAI opened the doors for alpha users to provide valuable feedback on GPT-4V's capabilities. In collaboration with Be My Eyes, a platform assisting visually impaired users, OpenAI developed a tool called Be My AI, integrating GPT-4V into their existing platform to provide image descriptions for blind and low-vision users. The app was beta-tested from March to August 2023, accumulating a user base of 16,000 who requested 25,000 daily descriptions.
Be My AI demonstrated significant improvements in reducing hallucinations and errors based on the testers' feedback but maintained the need for users to avoid relying on the AI for health and safety-related issues. The collaboration also sparked passionate conversations regarding facial recognition and privacy, urging researchers to develop responsible policies for recognizing facial characteristics without disclosing personal information.
During its developer alpha phase, OpenAI analyzed a sample of thousands of alpha testers' traffic data to better understand user queries and assess potential risks, including medical diagnosis, bias, privacy concerns, and sentiment analysis. This data analysis allowed refinements in evaluating and mitigating risky user queries.
Evaluation: A Look at Harmful Content, Privacy, and More
GPT-4V underwent thorough qualitative and quantitative evaluations, including refusal and performance accuracy metrics, to ensure a safe and secure system. Evaluations spanned several risk domains, such as harmful content, representation and allocation, privacy, cybersecurity, and multimodal jailbreaks.
To maintain consistency across demographics, the model's performance on sensitive trait attribution was monitored, resulting in further refusals for sensitive trait requests. Person identification accuracy and refusal evaluations showed a significant improvement in refusing potentially risky queries and reducing the model's ability to identify people in photos from public datasets.
Ungrounded inference evaluations gauged GPT-4V's ability to refuse requests for unjustifiable information based on the provided text and images. Multimodal jailbreak evaluations examined if visual input manipulation could be used to bypass safety systems, converting known text jailbreaks into screenshots to test the model's susceptibility.
External Red Teaming: Scientific Proficiency and Privacy
Extensive external expert analysis was employed to inspect GPT-4V's performance in various risk areas, such as scientific proficiency, medical advice, stereotyping and ungrounded inferences, disinformation, hateful content, and visual vulnerabilities.
While GPT-4V proved successful in interpreting complex information and critically assessing novel scientific discoveries in some instances, its performance remained unreliable in areas requiring high accuracy. Reports of hallucinations, authoritative factual errors, and misinterpretation of visual information preclude the model from being used for high-risk tasks, such as identifying dangerous compounds.
Medical professionals found inconsistencies in GPT-4V's ability to provide medical advice, with risks including inaccuracies, biases, and contextual misunderstandings. As such, the model should not be used as a substitute for professional medical guidance or judgment. Meanwhile, GPT-4V's approach to ungrounded inferences sparked discussions on the impact of unwanted and unintentional assumptions, prompting the need for additional safety measures.
Overall, GPT-4V showcases OpenAI's commitment to pushing the boundaries of AI research, exhibiting tremendous potential across various fields. Despite its achievements, the report acknowledges the limitations of deploying a multimodal language model and emphasizes the importance of user feedback, expert evaluations, and iterative deployment practices to ensure a responsible and secure system that benefits all.
LongLoRA: Efficient Fine-tuning of Long-context Large Language Models
Authors: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia
Source & References: https://arxiv.org/abs/2309.12307
Introduction
LongLoRA is a recent machine learning research study that presents a new approach to efficiently fine-tune large language models (LLMs) with long contexts, while keeping computation costs low. Training LLMs with long context sizes is typically very resource-intensive, requiring extensive training hours and GPU resources. The study's goal is to extend the context sizes of pre-trained large language models without incurring a massive computational burden typically associated with this process.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.