Dear subscriber,
We're back with another exciting special edition of the State of AI newsletter, marking an important milestone in our journey: 5000 subscribers! In this special issue, we're once again turning the spotlight on five more transformative AI research papers that have significantly shaped the world of artificial intelligence and machine learning.
These illustrious pieces of research serve as the backbone of our field, contributing to the advancements that we delve into in our regular editions. Today, we retrace our steps to understand how the groundbreaking "Deep Residual Learning for Image Recognition" forged a new path in deep learning, and how the innovative "Generative Adversarial Networks" paper set the stage for a whole new genre of generative models.
Continuing the theme of our last special edition, our goal is to take you on a journey through AI's rich history, underlining the pivotal moments and key ideas that have defined this dynamic field. Whether you're a veteran in AI or relatively new to the community, these seminal papers offer valuable insights that deepen our appreciation of AI's complex narrative.
We are thrilled to have you with us in this enlightening journey. Enjoy this exclusive edition as we celebrate our growing community and continued exploration into the fascinating world of AI.
Best regards,
Contents
Gradient-Based Learning Applied to Document Recognition
A Fast Learning Algorithm for Deep Belief Nets
Deep Residual Learning for Image Recognition (ResNet)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Mastering the Game of Go with Deep Neural Networks and Tree Search (AlphaGo)
Gradient-Based Learning Applied to Document Recognition
Authors: Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner
Source & References: https://www.researchgate.net/publication/2985446_Gradient-Based_Learning_Applied_to_Document_Recognition
Published: December 1998
Introduction
This research paper, written by a group of prominent machine learning researchers, focuses on the application of gradient-based learning techniques to the problem of document recognition, specifically handwritten character recognition. The authors demonstrate that using specialized neural network architectures, such as convolutional neural networks (CNNs), in conjunction with gradient-based learning methods, outperforms other techniques in the field. The paper also introduces a new learning paradigm called Graph Transformer Networks (GTNs), which allows for global training of multimodule systems to minimize overall performance measures.
Gradient-Based Learning: Overview and Importance
Gradient-based learning has become one of the most successful approaches in machine learning, particularly for training neural networks. It's based on optimizing a continuous, smooth loss function using gradient descent or similar algorithms. The authors argue that relying more on automatic learning methods and less on hand-designed heuristics, combined with recent advancements in computing power and the availability of large datasets, has led to significant progress in pattern recognition tasks, such as speech and handwriting recognition.
Convolutional Neural Networks: Incorporating Prior Knowledge
The authors emphasize the need for incorporating prior knowledge about the task into learning algorithms. Specifically, they introduce Convolutional Neural Networks (CNNs) that are designed to handle the variability of two-dimensional (2-D) shapes. By incorporating knowledge about invariances of 2-D shapes through local connection patterns and imposing constraints on the weights, CNNs are shown to outperform other techniques in handwritten character recognition tasks.
Comparative Study: Isolated Handwritten Digit Recognition
The paper presents a comparison of various methods applied to handwritten digit recognition. It demonstrates that CNNs trained with gradient-based learning methods perform better than all other techniques tested on the same dataset. This strengthens the case for relying more on machine learning and less on hand-crafted feature extraction for building recognition systems.
Global Training and Graph Transformer Networks
One of the main challenges in handwriting recognition is segmenting characters from their neighbors within a word or sentence. The authors propose techniques to train the recognizer using whole strings of characters instead of individual characters, minimizing an overall loss function. They introduce Graph Transformer Networks (GTNs), which represent alternative hypotheses and their scores using directed acyclic graphs, making it possible to train multimodule systems using gradient-based learning methods.
Heuristic Oversegmentation: Training Recognizers on Whole Strings
The authors discuss various methods of training a recognizer at the word level without requiring manual segmentation and labeling. This includes the Heuristic Oversegmentation (HOS) approach, which generates a large number of potential cuts between characters and selects the best combination based on scores from the recognizer. Training the system at the word level helps in dealing with the challenges of consistently labeling segmented characters and allows for better overall performance.
Space-Displacement Neural Networks: Eliminating the Need for Segmentation
The paper introduces the concept of space-displacement neural networks (SDNNs), an approach that eliminates the need for segmentation heuristics by scanning a recognizer at all possible locations within the input. This allows the recognizer to directly consider the context and the spatial relationships between characters, further improving the accuracy of recognition.
Unified Design Paradigm: Graph Transformer Networks and Applications
The authors present GTNs as a unified and well-principled design paradigm for building recognition systems. They discuss the connections between GTNs and other popular techniques in the field, such as Hidden Markov Models (HMMs). The paper also describes practical applications of GTNs, including a system for recognizing handwriting entered on a pen computer, demonstrating the advantages of training a recognizer at the word level and the flexibility of GTNs.
Commercial Applications: Reading Handwritten and Machine-Printed Bank Checks
Lastly, the authors describe a complete GTN-based system for reading handwritten and machine-printed bank checks. This system, which is deployed commercially and reads millions of checks monthly, includes the convolutional neural network called LeNet-5 at its core and achieves record accuracy on business and personal checks thanks to global training techniques and CNN-based character recognition.
Conclusion
Gradient-Based Learning Applied to Document Recognition highlights the advantages of using gradient-based learning techniques and specialized neural network architectures, such as CNNs, in tackling complex pattern recognition tasks. The introduction of Graph Transformer Networks as a unified design paradigm for multimodule systems demonstrates the potential for further improvements in pattern recognition applications, leading to practical solutions for real-world problems, such as the reading of handwritten and machine-printed bank checks.
A Fast Learning Algorithm for Deep Belief Nets
Authors: Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh
Source & References: https://pubmed.ncbi.nlm.nih.gov/16764513/
Published: July 2006
Introduction
This groundbreaking research paper, "A Fast Learning Algorithm for Deep Belief Nets," by Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh, explores a new approach to training deep belief networks using "complementary priors" that eliminates the issue of "explaining away" typically found in densely-connected belief nets. By addressing this hurdle, the authors propose an efficient greyscale algorithm that can learn deep, directed belief networks one layer at a time. The resulting generative model is highly effective in processing handwritten digit images, outperforming other discriminative learning algorithms.
Complementary Priors
The key innovation comes by cancelling out the "explaining away" effect in deep belief nets using complementary priors. The authors provide an example of a directed belief net utilizing complementary priors and demonstrate that it is possible to create a prior system that cancels out correlations in the likelihood term. This allows for the posterior distribution to be accurately modeled and unbiased samples to be generated from it. The paper demonstrates the surprising and not immediately obvious equivalence between a specific type of deep belief network and Restricted Boltzmann Machines (RBMs), which simplifies the learning process.
Equivalence between RBMs and Deep Belief Nets
The authors explore the similarities between RBMs and infinite directed networks with tied weights. They describe how the process of generating data from an infinite directed belief network with tied weights can be achieved through alternating Gibbs sampling, as it is for RBMs. The maximum likelihood learning rule in RBMs is equivalent to the learning rule for these deep nets. The authors also discuss contrastive divergence learning, which minimizes the difference between two Kullback-Leibler divergences and allows for a more efficient learning algorithm.
Greedy Learning Algorithm
Using the insights gained from the equivalence between RBMs and deep belief networks with tied weights, the authors propose a greedy learning algorithm for constructing multi-layer networks. The algorithm works by progressively "untying" weights in each layer from the weights in higher layers. Essentially, the learning algorithm is local, meaning that the connectivity of synapse strength depends only on the states of the pre-synaptic and post-synaptic neurons. This new approach reduces computational complexity and speeds up the training process.
Fine-Tuning with the "Up-Down" Algorithm
Once the base model has been trained using the greedy learning algorithm, the authors demonstrate how to fine-tune the weights using an "up-down" algorithm. This method is a contrastive version of the wake-sleep algorithm that does not suffer from the "mode-averaging" issues that can cause poor performance. The results show that a network with three hidden layers can accurately model the joint distribution of handwritten digit images and their respective labels.
Outperforming Discriminative Learning
The authors report that their proposed generative model outperforms discriminative learning methods on the MNIST database of handwritten digits when no prior knowledge of geometry is provided, and no special preprocessing is performed. The achieved error rate of 1.25% is superior to that of the best back-propagation networks and support vector machines. These results indicate the potential for the proposed learning algorithm in various practical applications.
Exploring the Generative Model
One significant advantage of this generative model is the ease with which the deep hidden layers' distributed representations can be interpreted. Since the network has a full-scale generative model, it is possible to generate images from high-level representations, which help in understanding the network's internal workings. This offers valuable insight into the thought process of the network when it is functioning independently, without visual input.
Conclusion
"A Fast Learning Algorithm for Deep Belief Nets" is an important step in the development of machine learning and deep learning algorithms. By introducing complementary priors to cancel out the explaining away effect and demonstrating the equivalence between RBMs and deep belief networks, Hinton, Osindero, and Teh were able to invent an innovative and efficient learning algorithm capable of outperforming other discriminative learning techniques in processing handwritten digit images.
Paving the way for improved learning algorithms that can handle more complex data, this research has had a lasting impact on the field of artificial intelligence. The presented approach represents a valuable advancement in deep learning, providing a better understanding of deep generative models, and offering the potential for even more sophisticated applications in future research.
Deep Residual Learning for Image Recognition
Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
Source & References: https://arxiv.org/abs/1512.03385
Published: 10 December 2015
Introduction
Deep convolutional neural networks have led to a series of breakthroughs for image classification. However, as network depth increases, the issue of vanishing/exploding gradients arises, making it difficult to train deeper models. In this work, authors Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun address these problems by introducing a deep residual learning framework, capable of training networks that are much deeper than those used previously.
The Concept of Residual Learning
The essence of residual learning lies in reformulating the layers of a neural network to learn residual functions with reference to the layer inputs instead of learning unreferenced functions. The authors hypothesize that residual mappings are easier to optimize than unreferenced mappings, enabling considerably increased depth and accuracy.
In the residual learning framework, instead of expecting each stacked layer to fit an underlying mapping, the layers are explicitly made to fit a residual mapping. If the optimal function is closer to an identity mapping than to a zero mapping, residual learning makes it easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one.
Identity Mapping by Shortcuts
Residual learning is achieved through the use of shortcut connections, which skip one or more layers in a neural network. These connections perform identity mapping and their outputs are added to the outputs of the stacked layers. Identity shortcut connections do not add extra parameters or computational complexity to the network and offer a reasonable preconditioning while addressing the degradation problem.
Network Architectures for ImageNet
Two main models are considered for the ImageNet dataset: plain networks and residual networks. A plain network is a straightforward network with a standard architecture, whereas the residual network incorporates the shortcut connections discussed earlier. The plain network serves as a baseline to compare the performance of the residual counterpart.
Results on ImageNet Classification
On the ImageNet dataset, the authors compared 18-layer and 34-layer plain networks, finding that the deeper plain network had higher validation error rates. The residual networks, however, did not have the degradation issue observed in the plain networks.
Using the residual framework, the authors successfully trained models with over 100 layers (maximum 152 layers). These models not only had low complexity compared to existing state-of-the-art networks, such as VGG nets but also outperformed them, with ensemble results achieving a 3.57% error rate on the ImageNet test set.
The excellent performance of the extremely deep residual networks led to the winning of the first place in the ILSVRC 2015 classification task, as well as outstanding generalization performance on other recognition tasks, obtaining first places in ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation at the ILSVRC & COCO 2015 competitions.
Additional Experiments and Analysis
The authors further conducted experiments on the CIFAR-10 dataset with 100 and 1000 layers, observing similar optimization difficulties and the positive effects of the residual learning framework. They concluded that residual learning is a generic principle applicable to other vision and non-vision problems.
Moreover, the learned residual functions had small responses compared to the original functions, which indicates that identity mappings provide reasonable preconditioning.
Conclusion
The authors successfully introduced a deep residual learning framework to tackle the vanishing/exploding gradients and degradation problems in training deep neural networks. This framework makes it possible to optimize extremely deep neural networks with up to 152 layers, outperforming state-of-the-art models like VGG nets while having lower complexity.
The results of this work provide evidence that residual learning is generic and can be applied to other vision and non-vision tasks, leading to outstanding performance in ImageNet and COCO competitions. This work paves the way for the exploration and development of even deeper and more efficient neural networks for various applications.
Overall, the deep residual learning framework represents an exciting and promising advancement in the field of deep learning, shedding light on the optimization of deep neural networks and expanding the capabilities of image recognition tasks.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Authors: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Source & References: https://arxiv.org/abs/1810.04805
Published: 18 October 2018
Introduction
BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary language representation model that has significantly impacted the field of natural language processing (NLP). In this summary, we'll dive into the key ideas behind BERT, how it works, and its impressive results on various NLP tasks. This groundbreaking method, introduced by researchers at Google AI Language, was designed to pre-train deep bidirectional representations from unlabeled text by conditioning on both the left and right context in all layers. The result is a powerful model that outperforms existing techniques in various NLP applications.
Why BERT matters
Traditional language representation models, like ELMo and OpenAI GPT, were unidirectional, meaning they only considered context from one direction (either left-to-right or right-to-left). This limitation hindered the potential of pre-trained representations, especially for token-level tasks like question answering, which require context from both directions.
BERT addresses this issue by employing a "masked language model" (MLM) pre-training objective. This approach allows it to train a deep bidirectional Transformer that captures context from both left and right, leading to a more expressive and context-aware representation.
One of the key contributions of BERT is its ease of use. Unlike previous methods that required extensive architecture modifications for specific tasks, BERT can be fine-tuned with just one additional output layer, allowing it to create state-of-the-art models for a wide range of applications without substantial alterations.
BERT's Architecture
At its core, BERT utilizes the multi-layer bidirectional Transformer encoder, based on the work of Vaswani et al. The primary building blocks of BERT are its layers (L), hidden size (H), and self-attention heads (A). Researchers primarily experimented with two model sizes: BERT_BASE (L=12, H=768, A=12) and BERT_LARGE (L=24, H=1024, A=16).
An essential feature of BERT is its unified architecture for different tasks, meaning it uses the same structure for both pre-training and fine-tuning.
Input/Output representations
BERT's input representation can accommodate single sentences or pairs of sentences in a straightforward manner. The model uses the WordPiece tokenization technique with a 30,000 token vocabulary and distinguishes sentences with special tokens ([CLS], [SEP]) and learned segment embeddings. Position embeddings are also used to maintain the token order.
During pre-training, BERT uses two unsupervised tasks: the masked language model (MLM) and the next sentence prediction (NSP).
Task #1: Masked Language Model (MLM)
MLM randomly masks a percentage of input tokens, requiring the model to predict the original tokens based on their surrounding context. This approach allows BERT to learn bidirectional context effectively, unlike traditional left-to-right or right-to-left language models.
Task #2: Next Sentence Prediction (NSP)
Many downstream tasks, like question answering and natural language inference, require understanding the relationships between sentences. NSP is a binary task that aims to predict whether a given sentence is the actual next sentence or a random one from the corpus. This simple yet efficient task contributes significantly to BERT's performance on QA and NLI tasks.
Pre-training & Fine-tuning
Before being fine-tuned for specific tasks, BERT is pre-trained on large amounts of unlabeled text, such as the BooksCorpus and English Wikipedia datasets. Fine-tuning essentially involves initializing the BERT model with pre-trained parameters and then fine-tuning all parameters end-to-end using labeled data from downstream tasks. BERT's self-attention mechanism allows it to easily adapt to various tasks, be it single text or text pair applications, making it extremely versatile.
Impressive Results
BERT has achieved state-of-the-art results on 11 NLP tasks, such as GLUE, MultiNLI, and SQuAD benchmarks, by pushing the performance in question answering, sentiment analysis, and named entity recognition, among others. It has also significantly advanced the state of the art in various applications, emphasizing the importance of bidirectional pre-training of language representations and the simplicity of the fine-tuning approach.
In conclusion, BERT is a conceptually simple, yet highly efficient and versatile model designed to revolutionize a wide range of NLP applications. Its ability to leverage deep bidirectional context while mitigating the limitations of unidirectional language models has made it a powerful tool for modeling various language understanding tasks.
Mastering the game of Go with deep neural networks and tree search
Authors: David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis
Source & References: https://www.nature.com/articles/nature16961
Published: 27 January 2016
Introduction
The game of Go, known for its complexity and large search space, has long posed a significant challenge for artificial intelligence (AI). This paper introduces AlphaGo, an AI program developed by Google DeepMind that utilizes deep learning techniques to excel in the ancient game. By training on expert human moves and through self-play, AlphaGo combines both value and policy networks to execute powerful search algorithms. This groundbreaking achievement marks the first time a computer program has defeated a human Go professional in full-sized matches, a milestone that was once considered a decade away.
The Importance of Go
While AI programs have already outperformed humans in games like chess, Go poses a unique challenge due to its vast search space and the difficulty in assessing board positions and predicting moves. The game requires players to make strategic moves that involve capturing territory and surrounding opponents' stones. Unlike chess, with its more brutally combinatorial nature, Go relies more on intuition and pattern recognition, which makes it more difficult for AI-powered algorithms to master.
Training AlphaGo
AlphaGo's success is built on deep neural networks, which are trained using a multi-stage pipeline consisting of several machine learning techniques. The authors start by training a supervised learning (SL) policy network to predict expert human moves using millions of positions from the Kaggle Go Server. This network is then improved through reinforcement learning (RL), which optimizes the policy to win games against previous versions of the network. Finally, a value network is trained to predict the outcome of games played by the RL policy network.
With this foundation in place, AlphaGo uses Monte Carlo Tree Search (MCTS) to efficiently search the vast space of possible games. MCTS combines the expertise of the policy network, which samples promising actions, and the insights of the value network, which evaluates the likely outcome of the game.
Evaluating AlphaGo's Performance
To assess AlphaGo's performance, the authors ran an internal tournament featuring variants of AlphaGo and other top Go programs, such as Crazy Stone, Zen, Pachi, and Fuego. With approximately five seconds of computation time per move, AlphaGo achieved a winning rate of 99.8% against other Go programs. Moreover, AlphaGo notably defeated the reigning European human Go champion, Fan Hui, with a score of 5-0.
Interestingly, the authors found that the SL policy network, which was designed to mimic human play, performed better in AlphaGo than the stronger RL policy network. This is because humans select a diverse range of moves, whereas the RL policy network optimizes for the single best move. On the other hand, the value function derived from the RL policy network performed better than its SL-policy counterpart.
Scaling Up AlphaGo
A significant challenge in using deep neural networks for MCTS is the computation time required to evaluate policy and value networks. To address this, AlphaGo employs an asynchronous multi-threaded search, dividing the task of searching the game tree among CPUs and GPUs. By distributing computation across 48 CPUs and 8 GPUs, the final version of AlphaGo can quickly and efficiently explore the game space.
The authors also built a distributed version of AlphaGo that uses multiple machines, with 176 GPUs and 1,202 CPUs. This distributed approach allowed AlphaGo to demonstrate even stronger performance when given more computational resources.
Implications and Future Work
AlphaGo's success in mastering the game of Go serves as an impressive demonstration of the power and potential of deep learning. By combining these advanced neural networks with efficient search algorithms, AI systems like AlphaGo can excel in domains that were once considered out of reach due to their complexity.
While AlphaGo's achievement represents a significant milestone in AI research, there is still much work to be done. The authors suggest future work could focus on reducing the reliance on human expert games in training and instead exploring the potential of unsupervised learning, as well as applying similar methods to other complex domains that require a combination of pattern recognition, planning, and decision-making.
In conclusion, AlphaGo has achieved a remarkable feat in demonstrating the power of deep learning to conquer the complex game of Go. Its success serves as a strong foundation for future research and applications, as AI continues to advance and address increasingly sophisticated problems.
First off, I love this newsletter. It does a great job of summarizing seminal research in the field. I also had a couple of unsolicited suggestions: 1. it might be nice to include the publication year / date (especially for one like this, sometimes I like putting these papers into historical context), and 2. it might also be useful t include key diagrams / images to help illustrate and solidify the concepts. Just a couple of friendly suggestions. Thanks again for doing this great work!