Octopus v2, MiniGPT-4-Video, Mixture-of-Depths, LLMs as compilers & Visual Autoregressive Modeling

Week 2, April 2024

Apr 11, 2024

∙ Paid

Greetings,

Welcome to the 53rd edition of the State of AI. This issue brings you to the forefront of innovation with insightful explorations into on-device language models, advanced multimodal understanding of videos, dynamic compute allocation in transformer models, the intriguing potential of language models in simulating pseudocode, and scalable image generation solutions. Each article in this edition presents cutting-edge research and potential breakthroughs that continue to push the boundaries of AI technology. Dive into a future shaped by these transformative ideas. Enjoy!

Best regards,

State of AI

Get 7 day free trial

Octopus v2: On-device language model for super agent
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Octopus v2: On-device Language Model for Super Agents

Authors: Wei Chen, Zhiyuan Li

Source and references: https://arxiv.org/abs/2404.01744v3

In a significant leap towards more secure and efficient use of artificial intelligence, Wei Chen and Zhiyuan Li of Stanford University have developed Octopus v2, an on-device language model that challenges the status quo of cloud-reliant AI systems. This revolutionary model not only enhances the privacy and cost-efficiency of AI applications but also promises superior performance in real-time responsiveness and accuracy, making it a formidable competitor to cloud giants like GPT-4.

The Drive for On-Device AI

In an era where data privacy and internet security are paramount, the reliance on cloud-based AI models presents notable risks and limitations. Every query processed over the cloud risks exposure, and each interaction incurs costs that accumulate rapidly at scale. Moreover, these systems require a stable internet connection, restricting their usability in low-connectivity areas.

Enter Octopus v2, a groundbreaking solution designed to run sophisticated AI algorithms directly on consumer devices such as smartphones, smartwatches, and even automobiles. By bringing AI processing on-device, Octopus v2 addresses the key concerns of privacy, latency, and availability while unlocking new potentials in user interaction.

Outperforming the Cloud Giants

Leveraging a compact yet powerful 2-billion-parameter model, Octopus v2 achieves what many on-device models have struggled with—matching and even surpassing the accuracy and efficiency of cloud-based counterparts like GPT-4. It significantly reduces the model's context length by 95%, improving both the speed and accuracy of responses.

This leap in performance is largely due to the novel function-calling mechanism that the Stanford team pioneered. Instead of processing extensive tokens to interpret commands, Octopus v2 employs a token-efficient approach, drastically cutting down the processing overhead. Comparatively, when set against Llama-7B’s RAG-based function calling, Octopus v2 boosts latency by an impressive 35-fold.

Rethinking Function Calls

One of the keystones of Octopus v2's efficacy is its innovative handling of function calls, typically a resource-intensive task in AI models. Traditional devices parse a multitude of possible commands in response to queries, a method that is both slow and prone to errors.

Octopus v2 simplifies this by introducing 'specialized tokens'—each representing a specific function call. This method not only quickens the decision-making process by reducing the number of potential matches but also enhances the model's accuracy in executing precise functions. These tokens act like shortcuts, instantly linking user commands with the correct function executions without the roundabout of traditional processing.

Machine Learning Meets Edge Computing

The training protocol for Octopus v2 included rigorous and tailored datasets encompassing common user interactions with technology, such as making calls, setting reminders, or navigating apps—all conducted via the Android API. These scenarios were carefully chosen to train the model not just to understand general language but to interpret and act on specific user intents relevant to mobile and edge-device functionality.

This bespoke training regimen ensures that Octopus v2 is not merely reactive but predictive, capable of understanding nuanced commands and executing them swiftly and accurately. This is particularly impactful in user-facing scenarios where speed and precision are critical, such as in driving or emergency communication.

Future Perspectives

The advent of Octopus v2 heralds a new era in AI where local processing might soon become the norm. This shift promises not only enhanced security and privacy but also greater accessibility, giving users uninterrupted access to AI capabilities regardless of their internet connectivity.

Moreover, as the world increasingly moves towards more integrated tech environments, the ability of AI models like Octopus v2 to process information on-device will be crucial in reducing latency and preserving bandwidth, ultimately leading to smarter and more responsive AI systems.

Concluding Thoughts

With Octopus v2, Wei Chen and Zhiyuan Li have not just presented an alternative to cloud-based AI—they have opened the door to possibilities that could redefine user interaction with technology. As this model moves closer to widespread adoption, it stands as a testament to the evolving relationship between machine learning, user privacy, and device capability, signaling a significant shift towards a more secure and empowered digital future.

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

Authors: Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny

Source and references: https://arxiv.org/abs/2404.03413

Introduction

In the fast-evolving domain of artificial intelligence, the application of Large Language Models (LLMs) to understand still imagery and text has seen notable successes. However, as digital content consumption shifts increasingly towards dynamic multimedia, like videos, the challenge pivots to developing models that can interpret and interact with both visual and textual data in video format. The paper introduces MiniGPT4-Video, a revolutionary model designed to tackle this burgeoning need.

Get 20% off forever

Keep reading with a 7-day free trial

Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.