Greetings,
Welcome to the eighth edition of State of AI. This week, we delve into the versatility of AI across diverse fields. We start with an exploration of rapid language learning by neural networks, where Bayesian priors serve as a new approach to mimic human-like language acquisition.
We then look at the expansive role of AI in biomedicine with a unified generative pre-trained transformer for multimodal tasks. This tool has the potential to revolutionize clinical decisions, diagnostics, and patient care.
A pivot towards database management unveils an improved language model adept at converting natural language into SQL, making data querying more intuitive. We also challenge the traditional norms in AI by studying how transformers can directly operate on file bytes.
Finally, we ponder on 'Thought Cloning', a concept pushing AI closer to mimicking human thought processes. Join us on this intriguing journey into the ever-expanding realms of AI.
Best regards,
Contents
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Bytes Are All You Need: Transformers Operating Directly On File Bytes
SQL-PaLM: Improved Large Language ModelAdaptation for Text-to-SQL
Modeling rapid language learning by distilling Bayesian priors into artificial neural networks
BiomedGPT: A Unified and Generalist Biomedical Generative Pre-trained Transformer for Vision, Language, and Multimodal Tasks
Thought Cloning: Learning to Think while Acting by Imitating Human Thinking
Authors: Shengran Hu, Jeff Clune
Source & References: https://arxiv.org/abs/2306.00323v1
Introduction
Thought Cloning is a new Imitation Learning framework that aims to improve Reinforcement Learning (RL) agents by teaching them to think like humans. The authors hypothesize that the reason for RL agents' cognitive deficiencies is that they lack the benefits of thinking in language, which humans possess. The paper introduces a novel approach where the idea is to not just clone the behaviors of human demonstrators, but also the thoughts humans have as they perform these behaviors.
Thought Data – An Untapped Resource
While there is an abundance of data available on the internet showing humans thinking out loud while performing tasks, this thought data is often neglected. YouTube videos and transcripts, for instance, contain millions of hours of people talking out loud while performing tasks – revealing their thinking, planning, decisions, and replanning (e.g., while playing video games). The authors propose to use this untapped thought data to teach thinking skills to AI agents by imitating human thinking.
Thought Cloning Framework
The Thought Cloning framework comprises an Upper-level Component responsible for thought generation and a Lower-level Component for executing actions based on the thoughts generated by the Upper-level Component. The agents are trained on synchronized datasets of human thinking while acting, learning to produce natural language thoughts at each time step and conditioning their actions by these generated thoughts.
The training process gives rise to a bi-level architecture based on memory-augmented vision-language models (VLM) and other components. The architecture utilizes LSTM and FiLM for modality fusion, effectively combining visual and text input.
To validate Thought Cloning, the authors construct a synthetic thought dataset to simulate having internet-scale datasets. They demonstrate their approach on the BabyAI domain—a simulated partially observable 2D gridworld domain that presents several challenges in terms of partial observability, hard-to-explore mazes, complex missions in natural language, and long-horizon planning.
Comparing Thought Cloning to Behavioral Cloning
The paper compares the Thought Cloning (TC) approach to the conventional Imitation Learning algorithm, Behavioral Cloning (BC). As opposed to TC, BC doesn't encode thought and is trained only on action loss. The authors also present an ablation variant called TC w/o Imitating Thought, which shares the same architecture as TC but is trained without Thought Cloning loss.
The results demonstrate that TC learns much faster than BC, ultimately outperforming it at the end of training. This suggests that natural language can indeed help agents learn to explore and plan better. Additionally, TC outperforms the ablation variant, indicating that the addition of thought cloning loss is essential to its success.
Generalizing to Novel Situations
A crucial aspect of the Thought Cloning framework is its ability to generalize and handle novel situations. The authors test TC, BC, and the ablation variant on out-of-distribution tasks in both zero-shot and fine-tuning settings. The results indicate that TC indeed generalizes better than the other approaches, showcasing its efficacy in out-of-distribution scenarios.
Safety, Interpretability, and Debugging
Apart from improved performance, Thought Cloning also offers benefits regarding AI safety, interpretability, and debugging. The framework allows observing the agent's thoughts, which helps diagnose issues, steer the agent, and prevent unsafe actions. Additionally, this interpretability makes it easier to train more intelligent and safer AI agents.
Conclusion
Thought Cloning presents a promising approach to train AI agents that think in language, by imitating human thinking. The framework outperforms conventional learning algorithms and demonstrates better generalization in novel situations, while also enhancing safety and interpretability. With more research, Thought Cloning has the potential to create more powerful and safer AI agents in the future. While validated with synthetic thought data, the real power of Thought Cloning is expected to be unlocked when trained on large-scale internet datasets of humans thinking out loud while acting.
Bytes Are All You Need: Transformers Operating Directly On File Bytes
Authors: Maxwell Horton, Sachin Mehta, Ali Farhadi, Mohammad Rastegari
Source & References: https://arxiv.org/abs/2306.00238v1
Introduction
In this fascinating new paper from a team at Apple, researchers propose a completely new approach to deep learning with ByteFormer, a model capable of performing inference directly on file bytes, without the need for decoding or processing in any specific modality. It is a big leap from the common practice of decoding file bytes into modality-specific representations, such as RGB tensors for images or MFCCs for audio before passing them into a neural network. This groundbreaking method shows the potential to revolutionize how we leverage deep learning models, with potential applications in privacy-preserving inference.
Keep reading with a 7-day free trial
Subscribe to State of AI to keep reading this post and get 7 days of free access to the full post archives.