Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models

Authors: Haolang Lu, Yilian Liu, Jingxin Xu, Guoshun Nan, Yuanlong Yu, Zhican Chen, Kun Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we conduct a comprehensive audit of hallucinations in Reasoning Large Language Models, revealing that ungrounded reflection and prompt-aligned bias are key drivers of false belief reinforcement in long-chain reasoning. By modeling the evolution of hallucinations under controlled knowledge settings and analyzing reflective Co T behaviors, we demonstrate that current detection and intervention methods lack the granularity and robustness needed to handle complex, multi-step hallucinations. We also explicitly state: In this paper, we conducted extensive experiments without involving theoretical numerical simulations.
Researcher Affiliation Academia Haolang Lu1 Yilian Liu1 Jingxin Xu1 Guoshun Nan1 Yuanlong Yu1 Zhican Chen1 Kun Wang2 1Beijing University of Posts and Telecommunications, China 2Nanyang Technological University, Singapore
Pseudocode Yes The pseudocode for Type I (Seen but Unlearned) Question Answer Generation is shown in Algorithm 1. A simplified sample example is shown in Figure 12. The question generation prompt is displayed in Figure 16. The pseudocode for Type I Control (Correct Answer) Question Answer Generation is shown in Algorithm 2. A simplified sample example is shown in Figure 13. The question generation prompt is shown in Figure 17. The pseudocode for Type II Question Answer Generation is shown in Algorithm 3, and a simplified example appears in Figure 14. The question generation prompt is shown in Figure 18.
Open Source Code Yes Our code is available at this link. Also, in the 'Open access to data and code' section, it states: In this paper, we provide links to both the experimental code and dataset, enabling full reproducibility of all reported results when combining the code with the provided data.
Open Datasets Yes Our Controlled Hallucination Audit Dataset, the first to audit Long-Co T hallucinations in RLLMs, primarily comprises question and reasoning-answer generation. All data synthesis was conducted under strict human oversight to ensure annotation quality. Additionally, in the 'Open access to data and code' section, it states: In this paper, we provide links to both the experimental code and dataset, enabling full reproducibility of all reported results when combining the code with the provided data.
Dataset Splits Yes The dataset is divided into four subsets: Type I (Seen but Unlearned), Type I Control (Correct Answer), Type II (Unseen or Erroneous), and Type II Control (Error Rejected). Table 1: Comparison of statistics across two types of hallucination and their respective control groups. Sample Size (Questions) 439 500 484 92 Sample Size (Answers) 439 * 5 500 * 5 484 * 5 92 * 5 Relevant RFCs number 314 50 50 38. In this experiment, we select 70 samples for validation, including 40 hallucination samples and 30 non-hallucination samples.
Hardware Specification Yes All experiments were conducted on a Linux server running Ubuntu 20.04.1 LTS (kernel version 5.15.0-124-generic, x86_64 architecture). The server is equipped with two Intel Xeon Gold 6248R 3.00 GHz processors (dual socket, 24 cores and 2 threads per socket, totaling 96 logical CPUs), 502 Gi B of RAM, and two NVIDIA A100-SXM4-80GB GPUs. The system uses driver version 535.161.07 with CUDA 12.2.
Software Dependencies Yes The software environment includes Python 3.9, Py Torch 2.2.0, and Hugging Face Transformers 4.39.3.
Experiment Setup Yes For the Dataset Construction section, we used the Deep Seek-R1 API [39] and Chat GPT-4o API [2] to synthesize data and assist in the manual verification of samples. We evaluated performance based on Deep Seek s officially released distilled model, Deep Seek-R1-Distill-Qwen14B [39] for the Hallucination Detection section. For each question, sample 5 independent answers, recording the RFC number and reference location for each response. For detection, we compute top-k logit entropy by normalizing the entropy over the K most probable tokens, following Sriramanan et al. [52], we select an optimal threshold on the validation set, and then apply that threshold during testing to flag hallucinated outputs. Given a user prompt, it first generates a deterministic main response at temperature 0, then produces N stochastic samples at temperature 1.