Do Androids Know They’re Only Dreaming of Electric Sheep? - AI Pulse

Probing the Hallucinations of AI: Unveiling the Internal Representations of Language Models

Dec 29, 2023

AI Pulse

Welcome to the second issue of AI Pulse. Our goal is simple: each issue will focus on breaking down one important AI or Machine Learning paper. We aim to provide clear, in-depth analysis so that our readers, whether they're professionals, academics, or enthusiasts, can easily understand key developments in the field.

Get 7 day free trial

Reading Time: ~4 Minutes

Do Androids Know They're Only Dreaming of Electric Sheep?

Authors: Sky CH-Wang, Benjamin Van Durme, Jason Eisner, Chris Kedzie

Source and references: https://arxiv.org/abs/2312.17249

Catching Hallucinations in AI Models

Language models are designed to generate text and complete tasks, such as summarization, dialogue, and data-to-text generation. Sometimes, these models randomly incorporate information that is not present in the input, a phenomenon known as hallucination. Hallucinations can be classified as intrinsic, where the generated response directly contradicts the input and extrinsic, where the response is neither supported nor contradicted by the input.

The central question of the paper is: can language models detect when they're hallucinating? This means understanding if language models know whether generated text is grounded in the input or is simply plausible information that came from the model's own knowledge.

The research aims to develop "probes" that can detect hallucinations in the hidden states of transformer language models during text generation.

A High-Quality Dataset for Hallucination Detection

The authors created a dataset of over 15,000 examples with hallucination annotations. They focused on three tasks: abstractive summarization, knowledge-grounded dialogue generation, and data-to-text generation. For each task, they generated outputs from a large language model conditioned on the inputs, and then had human annotators mark hallucinations in those outputs.

The resulting examples were further split into organic hallucinations, which were sampled from the model's outputs, and synthetic hallucinations, which were created by manually editing reference inputs or outputs to introduce discrepancies.

Predicting Hallucination with Probes

Three different probe architectures were proposed for detecting hallucinations during decoding, by probing the hidden states of a transformer. The goal is to determine whether the model has just begun to hallucinate or if it will eventually generate ungrounded text. By studying different factors like annotation type (synthetic/organic), hallucination type (extrinsic/intrinsic), model size, and which part of the encoding is probed, the researchers developed a deep understanding of when and where hallucinations occur.

The researchers found that using synthetic examples for hallucination detection had lower utility compared to organic examples, as they didn't come directly from the test distribution. Furthermore, hidden state information about hallucination was dependent on the task and distribution.

Probes for Improved Hallucination Detection

The three proposed probes were shown to outperform multiple contemporary baselines in hallucination detection when model states were available. This discovery introduces a feasible and efficient alternative to evaluate language model hallucinations.

The study revealed that extrinsic hallucinations tend to be more salient in a transformer's internal representations, which means they are easier to detect. Additionally, the layers and hidden state types varied their hallucination saliency depending on the task.

Implications for AI and Language Models

Understanding when a language model is hallucinating is essential for improving the groundedness and reliability of its generated text. By harnessing this knowledge, developers can create better-performing AI models that produce human-like text while still being faithful to the input data.

The research highlights the importance of investigating internal representations in transformer models to get insights into their behavior. Detecting hallucinations during decoding can help improve the performance of AI models, making them more reliable for tasks such as summarization, dialogue, and data-to-text generation.

In conclusion, this study demonstrates the feasibility of using probing techniques to detect and analyze hallucination behavior in transformer language models. By combining these insights with improvements in language model design and training, we can create AI models that produce more faithful and grounded generations, leading to better performance across various tasks.