Greetings,
Welcome to the 39th edition of the State of AI. In this groundbreaking issue, we explore the cutting-edge advancements shaping the future of artificial intelligence. Discover TinyGPT-V's remarkable efficiency in large language models using small backbones, dive into the innovative world of LARP for open-world gaming, and unravel the complexities of video synthesis with FlowVid. Witness the seamless integration of AI and web technology in City-on-Web, and marvel at the multilingual capabilities of AnyText in visual text generation and editing.
Each article showcases the dynamic and transformative power of AI, providing an engaging and enlightening journey through the latest developments in the field. Get ready to be inspired!
Best regards,
Contents
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
LARP: Language-Agent Role Play for Open-World Games
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web
AnyText: Multilingual Visual Text Generation And Editing
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Authors: Zhengqing Yuan, Zhaoxu Li, Lichao Sun
Source & References: https://arxiv.org/abs/2312.16862
A New Wave in Multimodal Learning
Breakthroughs in multimodal learning have led to impressive achievements in combining visual and textual information processing. Models like GPT-4V and open-source alternatives, such as LLaVA and MiniGPT-4, have showcased ground-breaking results across various tasks, like image captioning and visual question answering. However, these models' computational efficiency remains a problem, with their large number of parameters demanding massive computational resources.
Enter TinyGPT-V, a new-wave model that marries high performance with common computational capacity. Unlike its resource-heavy counterparts, TinyGPT-V requires just a 24G GPU for training and a mere 8G GPU or CPU for inference. But how does TinyGPT-V achieve this computational efficiency without sacrificing performance?
Powerful Yet Efficient Model Architecture
TinyGPT-V's backbone hinges on the Phi-2 large language model (LLM) and pre-trained vision modules from BLIP-2 or CLIP. This combination results in a model with only 2.8 billion parameters capable of achieving remarkable performance across several tasks. The TinyGPT-V architecture consists of a visual encoder, linear projection layers, and a large language model.
The visual encoder remains inactive during training, while the linear projection layers connect the visual features extracted by the visual encoder to the language model. For its large language model, TinyGPT-V utilizes Phi-2, an efficient and competent model with 2.7 billion parameters that can match or even outperform models 25 times larger.
To address challenges in small LLM transfer learning and eliminate gradient vanishing, TinyGPT-V incorporates normalization techniques. LayerNorm and RMSNorm are applied to stabilize the data, while Query-Key Normalization is included for low-resource learning scenarios.
A Four-Stage Training Process
TinyGPT-V follows a four-stage training process: warm-up training, pre-training, human-like learning, and multi-task learning. During warm-up training, the model learns vision-language understanding using a large library of aligned image-text pairs. In the pre-training stage, the model utilizes the same dataset to fine-tune the LoRA module, an efficient fine-tuning method that does not increase inference time.
In the human-like learning stage, the model is fine-tuned with image-text pairings from MiniGPT-4 or LLAva, while the fourth stage focuses on enhancing its conversation ability as a chatbot by tuning the model with various multi-modal instruction datasets, such as LLaVA, Flickr30k, and a mixed multi-task dataset.
Multi-task Instruction Templates
To mitigate potential ambiguity when handling various distinct tasks, such as visual question answering and image captioning, the model uses MiniGPT-v2 tokens of task-specific instructions within a multi-task instruction template. This template facilitates disambiguation among tasks and allows TinyGPT-V to execute tasks with more precision and accuracy.
A Leap Forward in Multimodal Large Language Models
TinyGPT-V represents a substantial step in achieving the balance between high performance and computational efficiency in multimodal large language models. By leveraging the Phi-2 model, TinyGPT-V proves that smaller, more efficient models can achieve results comparable to much larger counterparts. Thus, the TinyGPT-V model fosters further developments for designing cost-effective, efficient, and high-performing MLLMs, expanding their applicability in a broad array of real-world scenarios.
Language-Agent Role Play for Open-World Games
Authors: Ming Yan, Ruihao Li, Hao Zhang, Hao Wang, Zhilan Yang, and Ji Yan
Source and references: https://arxiv.org/abs/2312.17653
Introduction
In recent years, large language models (LLMs) have gained significant traction in various fields, from translation to question answering. The development of these LLMs has inevitably intertwined with another growing technology: agents (also known as AI entities). The natural next step is to create language agents, systems that incorporate both LLMs and agent architectures.
One particularly ripe area for exploration is the application of language agents in open-world games. These immersive environments demand the use of agents to play non-player characters (NPCs) with diversified behaviors, adapting to complex game scenarios and maintaining a consistent narrative experience.
As a step towards integrating language agents with open-world gaming, we present the Language-Agent Role Play (LARP) framework. LARP is designed to smoothly blend open-world games with language agents through a modular approach, covering memory processing, decision-making, and continuous learning from interactions.
Cognitive Architecture
The cognitive architecture of LARP consists of four major modules: long-term memory, working memory, memory processing, and decision-making. Long-term memory serves as the storage for substantial knowledge, while working memory holds temporary caches for currently relevant information.
External databases are used for semantic memory, while episodic memory is stored in vector databases. Procedural memory is implemented in action spaces, where actions can be learned and extended.
The memory processing module takes in perceived input information and transforms it into content for long-term memory, enabling the recall and reconstruction of stored memories. It incorporates a combination of natural language transformation rules, predicate logic, and vector similarity searches. In addition, decay parameters ensure that older memories are gradually forgotten over time.
Decision-making in LARP revolves around a series of programmable units that process content in working memory and game context. These units can be simple information processing tasks or more complex trained models, with the ability to update working memory in real-time.
Environment Interaction
For language agents to interact with an open-world game, they must perform two tasks: generate actions based on current observations and update the game state according to those actions. LARP addresses both of these requirements by first constructing tasks or dialogues for NPCs within agents and then metaphorically "executing" them in the game world.
Tasks are generated using the information obtained from both working and long-term memory and decision-making modules. The actions to be executed in the game are determined in the action space of LARP. This space is divided into two types of APIs: public and personal. Public APIs represent general actions available to all agents, while personal APIs can vary depending on the agent's abilities and may expand as new actions are learned.
To maintain an engaging gaming experience, LARP incorporates reinforcement learning techniques that promote continuous learning from interactions. Agents receive feedback from the environment and update their action preferences accordingly, allowing them to adapt their responses to user input and other in-game events.
Aligning Personalities
Diverse and consistent character personalities are a vital aspect of open-world games. To ensure NPCs stay true to their predefined backgrounds and behavioral tendencies, LARP introduces a postprocessing module that aligns generated actions and dialogues with specific personality types.
This module acts as a filter, verifying the coherence of an agent's output and potentially reprocessing it when necessary. In doing so, the framework ensures agent actions are consistent with their character and improve player immersion in the game world.
Conclusion
The LARP framework offers a comprehensive solution for integrating language agents with open-world games. Combining cognitive architectures, environment interaction modules, and personality alignment mechanisms, LARP enhances the gaming experience while showcasing the potential of LLMs in various application scenarios, such as entertainment, education, and simulation.
The modular design of this architecture encourages continued development, improvements, and extensions to meet the evolving demands of open-world gaming. Future research can build upon this foundation to explore even more immersive and engaging language-agent-driven experiences.
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis
Authors: Feng Liang, Bichen Wu, Jialiang Wang, Licheng Yu, Kunpeng Li, Yinan Zhao, Ishan Misra, Jia-Bin Huang, Peizhao Zhang, Peter Vajda, Diana Marculescu
Source and references: https://arxiv.org/abs/2312.17681
Bridging the Gap in Video Synthesis
Traditional image-to-image (I2I) synthesis has progressed significantly, but video-to-video (V2V) synthesis still faces challenges, particularly in maintaining temporal consistency across frames. FlowVid tackles this issue by taking advantage of both spatial conditions and temporal optical flow information within a source video. With a flexible, efficient, and high-quality approach, FlowVid aims to improve V2V synthesis while accounting for the imperfections in optical flow estimation.
The FlowVid Framework
FlowVid is designed to support several applications, such as global stylization (e.g., converting video to 2D anime), object swaps (e.g., swapping a panda for a koala), and local edits (e.g., adding a pig nose to a panda). In the process, the framework uses optical flow to maintain temporal consistency while handling the imperfections inherent in flow estimation.
The authors expanded on existing image U-Net architectures to accommodate videos, incorporating spatial-temporal attention alongside spatial conditions like depth maps. FlowVid then trains a video diffusion model to predict the input video using these conditions during the generation stage, allowing it to propagate edits throughout the video. With an autoregressive mechanism, FlowVid can generate lengthy videos in a fraction of the time of current methods.
Optical Flow: A Soft Condition
FlowVid introduces a soft optical flow condition to efficiently handle potential inaccuracies in flow estimation. Given a sequence of frames, the framework calculates the optical flow between the first frame and other frames, using a pretrained flow estimation model (UniMatch). It then performs a forward-backward consistency check to create forward and backward occlusion masks.
By incorporating both spatial controls (like depth maps) and temporal flow conditions as part of its training process, FlowVid can generate more consistent and accurate videos. The resulting model flexibly adapts to different types of modifications, including stylization, object swaps, and local edits.
Edit-Propagate Design for Generation
To leverage existing I2I models effectively, FlowVid employs an "edit-propagate" method for video synthesis. First, the model edits the first frame using I2I models, obtaining an edited first frame. It then propagates the edits to subsequent frames through flow warping and occlusion masks derived from the input video. The spatial conditions from the input video help guide the structural layout of the synthesized video.
This decoupled design allows FlowVid to generate lengthy videos quickly while maintaining high quality. In comparison with current V2V methods, FlowVid significantly outperforms them in terms of speed and user preference rates.
A Powerful and Efficient Approach to Video Synthesis
FlowVid demonstrates a robust and efficient approach to V2V synthesis. Its ability to harness the benefits of optical flow while handling its imperfections sets it apart from existing methods. The decoupled edit-propagate design supports multiple applications and enables the generation of lengthy videos using autoregressive evaluation.
The overall efficiency and quality of FlowVid make it an appealing choice for video synthesis applications. It is capable of generating a 4-second, 30 FPS, and 512x512 resolution video in just 1.5 minutes, outperforming current state-of-the-art methods like CoDeF, Rerender, and TokenFlow. In user studies, FlowVid was preferred 45.7% of the time, underscoring its ability to deliver high-generation quality.
Conclusion
FlowVid represents a major leap forward in V2V synthesis, enabling various forms of video editing while maintaining high quality and efficiency. By incorporating both spatial and temporal conditions and leveraging the power of diffusion models, FlowVid can generate consistent, tailored videos for different applications.
The promising results of this research open up new possibilities for text-guided video synthesis and the broader film industry. As FlowVid continues to develop and improve, it's likely that this technology will play a crucial role in shaping the future of video editing and synthesis.
City-on-Web: Real-time Neural Rendering of Large-scale Scenes on the Web
Authors: Kaiwen Song, Juyong Zhang
Source and references: https://arxiv.org/abs/2312.16457
Introduction
3D scene reconstruction has come a long way, thanks to emerging techniques like Neural Radiance Fields (NeRF). Researchers have managed to create high-quality real-time rendering of small scenes on various devices, including laptops and smartphones. However, this success isn't easily replicable in large-scale scenes because of the constraints in computational power, memory, and bandwidth.
This paper introduces a new technique called City-on-Web, which solves these issues by dividing the scene into manageable blocks with varying levels of detail (LoD). The authors ensure seamless real-time rendering with high fidelity, efficient memory management, and fast processing for large-scale scenes.
This solution paves the way for photo-realistic large-scale scene rendering on web platforms, achieving 32 frames per second at a 1080p resolution with limited GPU resources. The City-on-Web approach preserves reconstruction quality, making it a groundbreaking innovation in this field.
Large-scale Scene Reconstruction and Real-time Rendering
Various techniques have been developed for large-scale scene reconstruction, such as Block-NeRF, Mega-NeRF, Switch-NeRF, Grid-NeRF, and NeRF++. These methods segment large scenes into smaller blocks and focus on improving the model's representational capacity to capture details in large scenes.
Real-time rendering focuses on the speed of rendering and is essential for a seamless user experience. Innovative solutions like NSVF, KiloNeRF, SNeRG, Termi-NeRF, DONeRF, and MobileNeRF have emerged to address this challenge.
Level of Detail (LoD) techniques further optimize the rendering process and have been applied to neural implicit reconstruction recently. NGLoD, BungeeNeRF, TrimipRF, and LoD-Neus use multi-scale representation to capture details at different levels, enhancing reconstruction quality.
City-on-Web: Large-scale Radiance Field with LoD
The authors introduce a City-on-Web method that divides large scenes into blocks and hierarchically partitions them with varying LoD. Each block has a low-resolution voxel and high-resolution triplane representation, enabling dynamic resource management and efficient rendering.
City-on-Web applies a scene contraction function to peripheral blocks, accounting for data at the boundaries. By integrating these concepts, the authors successfully render large-scale scenes on the web with minimal memory footprint while maintaining high fidelity and performance.
Consistent Training and Rendering
To ensure a high-quality rendered output, it's important to maintain consistency between the training and rendering stages. City-on-Web uses multiple shaders to render different scene blocks. It then combines these blocks using volume rendering weights and opacity, ensuring a seamless output and 3D consistency at inter-block boundaries.
This approach not only simulates the rendering process on the web but also maintains the same reconstruction quality as traditional methods.
Optimization Strategies and LoD Generation
City-on-Web incorporates optimization strategies like dynamic resource loading, real-time ray tracing, and data transmission. By using LoD techniques, the authors minimize the load of distant resources while focusing on near-surface details.
LoD generation is achieved by generating different levels of spatial partitioning based on the reconstruction results, allowing for memory-efficient and high-quality rendering.
Baking the Model for Real-time Rendering
For real-time rendering on the web, City-on-Web employs a baking process to store the scene data in 3D atlas textures for each block. It then uses shaders to render each block based on the viewer's position in real-time, enabling fast rendering with limited resources.
This method performs efficiently, even on browsers with limited resources. City-on-Web outperforms mesh-based methods, with a much lower VRAM usage and payload size.
Conclusion
City-on-Web is a groundbreaking technique for real-time neural rendering of large-scale scenes on the web. By dividing scenes into manageable blocks with varying LoD, it achieves high fidelity rendering, efficient memory management, and fast processing with limited resources.
City-on-Web's consistent training and rendering approach maintains reconstruction quality, outperforming existing mesh-based methods. With the ability to render photo-realistic large-scale scenes on web platforms, City-on-Web sets a new benchmark for real-time rendering.
AnyText: Multilingual Visual Text Generation and Editing
Authors: Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, Xuansong Xie
Source and references: https://arxiv.org/abs/2311.03054
Introduction
One key challenge in the field of image synthesis is to generate high-fidelity images with legible and accurate text. As we know, many generative diffusion-based models have been introduced recently, which achieved impressive results in image quality and versatility. However, they still struggle with generating clear, readable visual text. The authors of this paper, Yuxiang Tuo, Wangmeng Xiang, Jun-Yan He, Yifeng Geng, and Xuansong Xie, decided to take on this challenge by introducing AnyText, a diffusion-based multilingual visual text generation and editing model.
AnyText Framework
The main components of the AnyText framework are the text-control diffusion pipeline, the auxiliary latent module, and the text embedding module. They work together to accurately generate and edit coherent text in images. The text-control diffusion pipeline helps to render accurate text by utilizing the auxiliary latent module's information, which includes text glyph, position, and masked images. The text embedding module combines stroke data from an Optical Character Recognition (OCR) model with image captions to generate background-integrated texts.
The authors took a unique approach to make AnyText stand out from their competitors. They focused on the following five functionalities:
Multi-line: AnyText can render texts on multiple lines at user-specified positions.
Deformed regions: The model allows writing text in horizontal, vertical, curved, or irregular regions.
Multi-lingual: AnyText can generate text in various languages, such as Chinese, English, Japanese, and Korean.
Text editing: The model can modify text content within an input image while maintaining consistency with the surrounding text style.
Plug-and-play: AnyText can be easily integrated with existing diffusion models for rendering or editing text accurately.
The Power of the Text-Control Diffusion Pipeline
In the text-control diffusion pipeline, the authors utilized three types of auxiliary conditions—glyph, position, and masked image—to produce a latent feature map that can effectively control the generation of visual text. This feature map is then fed into the TextControlNet, a trainable network that predicts noise added to noisy latent images. By adding this network, the pipeline can focus on generating text while preserving the base model's ability to generate images without text.
Auxiliary Latent Module
To generate accurate texts in images, AnyText incorporates auxiliary information, like text glyph, position, and masked image.
Text glyph is created by rendering text in a designated position with a uniform font style. Position information is added to precisely locate the region of the text in an image. The masked image aids in deciding which region of the image should be preserved during the diffusion process.
The authors combine these image-based conditions using convolutional fusion layers, resulting in a generated feature map.
Text Embedding Module
AnyText adopts a novel approach for encoding both text glyph and caption semantic information. By rendering glyph lines into images and encoding glyph information, the authors bypass the traditional language-specific text encoders, making it possible to generate text seamlessly in multiple languages.
They utilize the PP-OCRv3 model to extract features from images and replace caption tokens with text line embeddings. This approach enables better integration of generated text with the background and allows for multi-language text generation.
Learning from Text Perceptual Loss
The authors introduced a text perceptual loss function to improve the accuracy of text generation. This loss function uses an OCR recognition model to extract image features before the last fully connected layer and enhances text generation at a pixel-wise level. By harnessing the position information, the text perceptual loss precisely targets the text area, comparing it with the corresponding area in the original image.
Results and Contributions
AnyText has outperformed other approaches in extensive evaluation experiments, showcasing its ability to generate high-quality, legible texts in multiple languages. The authors also contribute the first large-scale multilingual text images dataset, AnyWord-3M, containing 3 million image-text pairs with OCR annotations in multiple languages, and propose AnyText-benchmark for evaluating visual text generation accuracy and quality.
The project's source code will soon be open-sourced on GitHub to improve and promote the development of text generation technology.
Conclusion
AnyText, the multilingual visual text generation and editing model, successfully addresses the challenge of incorporating accurate and coherent text in the background of generated images. By blending the strengths of the text-control diffusion pipeline, auxiliary latent module, and text embedding module, AnyText achieves impressive results in generating readable text in multiple languages and curved or irregular regions. This powerful plug-and-play model can be integrated seamlessly into existing diffusion models for rendering and editing text, making it a valuable resource in the field of image synthesis.