Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Robust Hallucination Detection in LLMs via Adaptive Token Selection

Authors: Mengjia Niu, Hamed Haddadi, Guansong Pang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experimental results on four hallucination benchmarks show that Ha MI significantly outperforms existing state-of-the-art approaches. Code is available at https://github.com/mala-lab/Ha MI. ... 5 Experiments Datasets and Models. We evaluate our method on four popular benchmark datasets across a range of question-answering (QA) domains...
Researcher Affiliation	Academia	Mengjia Niu1,2 , Hamed Haddadi1, Guansong Pang2 1Imperial College London, UK 2Singapore Management University, Singapore EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in Section 4, 'Methodology', and presents a framework in Figure 2. However, there is no explicitly labeled 'Pseudocode' or 'Algorithm' section or block.
Open Source Code	Yes	Code is available at https://github.com/mala-lab/Ha MI.
Open Datasets	Yes	We evaluate our method on four popular benchmark datasets across a range of question-answering (QA) domains, including (1) Trivia QA [24]... (2) Stanford Question Answering Dataset (SQu AD for short) [41]... (3) Natural Questions (denoted as NQ) [28]... and (4) a biomedical QA corpus Bio ASQ [27].
Dataset Splits	Yes	For each dataset, we randomly extract 2, 000 QA pairs for training and 800 pairs for testing. ... The refined set from the remaining 300 QA pairs is used as a validation set for selecting the optimal layer for representation extraction. ... the final sizes of the training and testing sets vary around 1, 900 and 400, respectively.
Hardware Specification	Yes	We implement our method using Py Torch 2.6.0 [39] and transformers 4.51.3 [48] and conduct all experiments on NVIDIA A100 GPUs.
Software Dependencies	Yes	We implement our method using Py Torch 2.6.0 [39] and transformers 4.51.3 [48] and conduct all experiments on NVIDIA A100 GPUs.
Experiment Setup	Yes	For the main results, our hallucination detector fθ( ) utilises a two-layer MLP with a hidden dimension of 256. The first linear layer is followed by a Batch Norm and Re LU activations and the second linear layer is accompanied by a sigmoid output. ... At the training stage, we use Adam optimiser. ... Unless otherwise specified, the hidden dimension of the two-layer MLP is set to 256. k in Eq. 1 is dynamically determined by the generated token length (l), defined as k = 0.1 l + 1.