Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Hallucination Detection in LLMs via Adaptive Token Selection

Authors: Mengjia Niu, Hamed Haddadi, Guansong Pang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experimental results on four hallucination benchmarks show that Ha MI significantly outperforms existing state-of-the-art approaches. Code is available at https://github.com/mala-lab/Ha MI. ... 5 Experiments Datasets and Models. We evaluate our method on four popular benchmark datasets across a range of question-answering (QA) domains...
Researcher Affiliation Academia Mengjia Niu1,2 , Hamed Haddadi1, Guansong Pang2 1Imperial College London, UK 2Singapore Management University, Singapore EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the methodology in Section 4, 'Methodology', and presents a framework in Figure 2. However, there is no explicitly labeled 'Pseudocode' or 'Algorithm' section or block.
Open Source Code Yes Code is available at https://github.com/mala-lab/Ha MI.
Open Datasets Yes We evaluate our method on four popular benchmark datasets across a range of question-answering (QA) domains, including (1) Trivia QA [24]... (2) Stanford Question Answering Dataset (SQu AD for short) [41]... (3) Natural Questions (denoted as NQ) [28]... and (4) a biomedical QA corpus Bio ASQ [27].
Dataset Splits Yes For each dataset, we randomly extract 2, 000 QA pairs for training and 800 pairs for testing. ... The refined set from the remaining 300 QA pairs is used as a validation set for selecting the optimal layer for representation extraction. ... the final sizes of the training and testing sets vary around 1, 900 and 400, respectively.
Hardware Specification Yes We implement our method using Py Torch 2.6.0 [39] and transformers 4.51.3 [48] and conduct all experiments on NVIDIA A100 GPUs.
Software Dependencies Yes We implement our method using Py Torch 2.6.0 [39] and transformers 4.51.3 [48] and conduct all experiments on NVIDIA A100 GPUs.
Experiment Setup Yes For the main results, our hallucination detector fθ( ) utilises a two-layer MLP with a hidden dimension of 256. The first linear layer is followed by a Batch Norm and Re LU activations and the second linear layer is accompanied by a sigmoid output. ... At the training stage, we use Adam optimiser. ... Unless otherwise specified, the hidden dimension of the two-layer MLP is set to 256. k in Eq. 1 is dynamically determined by the generated token length (l), defined as k = 0.1 l + 1.