Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Why and How LLMs Hallucinate: Connecting the Dots with Subsequence Associations

Authors: Yiyou Sun, Yu Gai, Lijie Chen, Abhilasha Ravichander, Yejin Choi, Nouha Dziri, Dawn Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show our method outperforms standard attribution techniques in identifying hallucination causes and aligns with evidence from the model s training corpus. This work provides a unified perspective on hallucinations and a robust framework for their tracing and analysis.
Researcher Affiliation Academia 1University of California, Berkeley, 2Max Planck Institute, 3Stanford University, 4AI2
Pseudocode Yes Algorithm 1 Masking and Refill, Algorithm 2 Beam Search-based Trace Subsequence Association
Open Source Code Yes Our code is available at https://github.com/sunyiyou/SAT.git.
Open Datasets Yes To evaluate the performance of our methodology in tracing hallucinations, we require a benchmark dataset where the hallucinated subsequences oh are explicitly identified. HALo GEN [44] offers a comprehensive benchmark... The training corpus consists of approximately 400M documents, each with 4096 tokens. For each test input subsequence trigger sh and the hallucinated output subsequence oh, we compute the conditional probability Pdoc Dolma(( sh, oh) doc)/Pdoc Dolma( sh doc) directly from the dataset and compare it to the reproducibility rate (Srep) tested on Olmo. Figure 6 illustrates this comparison, showing the Pearson correlation coefficients ρ for various test domains, where reproducibility rates are overall strongly correlated (ρ = 0.72).
Dataset Splits Yes In this study, we focus on six domains from HALo GEN that utilize programmatic verification to identify oh. This selection ensures that the identified hallucinations are objective and independent of LLM-based judgments. The chosen domains and their abbreviations are as follows: Code Package Imports (CODE), Scientific Attribution (REF), Biographies (BIO), False Presuppositions (FP), Rationalization Numerical (R-NUM), and two distinct Rationalization Binary domains one based on prime factorization (R-PRM) and the other on knowledge about U.S. senators (R-SEN). We provide more prompt details in Appendix C.2. ... To facilitate analysis, we subsample 100 prompts from each domain, selecting those with the most frequently occurring hallucinated subsequence.
Hardware Specification Yes Our computation ran on a machine with 8 H100. Each setting of experiment can be finished in 48 hours.
Software Dependencies No The paper mentions using 'Inseq: An interpretability toolkit for sequence generation models.' [51] for baseline implementations, but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes The approximation corpus, sampled using the algorithm described in Section 3, is set to a size of | ˆPs| = 512. During the generation process, tokens are replaced only within the queries, leaving the system prompts and chat template unchanged. The greedy search beam size B is set to 20. For each ˆP bert, rand, gpt-m, gpt-t in the evaluation, we use | ˆP| = 25. A hyperparameter sensitivity analysis can be found in Appendix C.3. ... For a fair comparison, we report s at the same length for all baseline methods with a fixed ratio of the original sequence length, | s| = α|s|. The results for α = 0.5 are presented in Table 4.1, while additional ratios (α = 0.25 and α = 0.75) are provided in Figure 4.