A Research Review of Why Hallucinations Exist (Statistical Bounds, Ecosystem Incentives)
- Nikita Silaech
- Sep 8
- 2 min read

By Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, Edwin Zhang | OpenAI
Published: September 4, 2023
Overview
This paper from OpenAI investigates hallucinations in large language models (LLMs) by formalizing the problem mathematically and providing empirical evidence of its inevitability. The authors argue that hallucinations stem from a fundamental mismatch between the next-token prediction objective and the requirement of factual correctness. Through proofs and experiments, they show that hallucinations persist even with more training data or larger models. They also evaluate hallucination frequency, confidence calibration, and the impact of different decoding methods across benchmark tasks.
Key Focus Areas
1. Formalizing Hallucinations
Hallucinations are defined as outputs with low factual accuracy despite high model confidence.
The authors provide statistical bounds showing that hallucinations cannot be eliminated, only minimized, under the current training paradigm.
2. Empirical Evidence
Across multiple datasets (e.g., QA benchmarks), hallucination rates remain non-trivial (10–20%) even for state-of-the-art models.
Larger models (GPT-4 compared to smaller LLMs) show reduced but non-zero hallucination frequency, demonstrating that scale mitigates but does not solve the problem.
Confidence calibration experiments reveal that LLMs often assign probabilities >0.9 to incorrect statements, highlighting a mismatch between confidence and truthfulness.
3. Causes of Hallucinations
Training Objective: Optimizing for likelihood, not truth.
Data Limitations: Gaps and inconsistencies in training corpora.
Inference Dynamics: Decoding strategies (temperature, nucleus sampling) increase variability, trading factuality for fluency.
4. CharacterizationThe paper distinguishes:
Intrinsic hallucinations – errors internal to model prediction.
Extrinsic hallucinations – errors from prompts outside the model’s training distribution.
5. Mitigation Approaches
Training-time alignment: Adjusting objectives for factual accuracy (e.g., reinforcement learning with human feedback).
Inference-time calibration: Constraining entropy, adjusting temperature, or using logit thresholds.
External grounding: Combining LLMs with retrieval-augmented generation (RAG) systems to anchor outputs in factual sources.
Strengths of the Paper
Provides both theoretical proofs and empirical measurements, making the case for hallucinations as unavoidable under current objectives.
Introduces quantitative benchmarks (hallucination frequency, calibration error) for systematically measuring reliability.
Demonstrates that hallucinations persist even at large scale, framing them as a long-term challenge rather than a transient artifact.
Future Directions
Benchmark Development: Expanding datasets dedicated to hallucination measurement across diverse domains.
Improved Calibration: Designing probability estimation methods that better reflect factual accuracy.
Hybrid Architectures: Large-scale evaluation of retrieval-augmented systems to measure factual grounding effectiveness.
Human-Centric Metrics: Studying hallucination severity in applied domains (e.g., law, healthcare) where consequences differ significantly.
Context in Ongoing AI Research
The study aligns with concurrent research efforts, including:
TruthfulQA (Anthropic, 2021): Similar emphasis on measuring factuality separately from fluency.
Mixture-of-experts & RAG systems: Align with proposed external grounding strategies.
Policy frameworks (EU AI Act, 2025): Increasingly demand transparency and safeguards against misleading outputs.
The OpenAI study “Why Language Models Hallucinate” provides one of the most rigorous treatments of hallucinations in LLMs to date. Through a combination of theoretical bounds and empirical results, it shows that hallucinations are a structural feature of next-token prediction and cannot be fully removed by scale alone. By formalizing definitions, quantifying prevalence, and mapping pathways to mitigation, this work establishes a strong foundation for both research and policy efforts aimed at reliable, safe deployment of language models. Read Full Article: Here
Comments