Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy

Authors: Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC.
Researcher Affiliation	Industry	Kamil Ciosek EMAIL Spotify Nicolò Felicioni EMAIL Spotify Sina Ghiassian EMAIL Spotify
Pseudocode	Yes	We summarize the ideas introduced in Sections 3.1, 3.2 and 3.3 in Algorithm 1. Algorithm 1 Estimate of Semantic Entropy for a prompt x.
Open Source Code	No	We will release the source code for both stages upon acceptance.
Open Datasets	Yes	We used the Trivia QA (Joshi et al., 2017), SQUAD (Rajpurkar et al., 2016), SVAMP (Patel et al., 2021) and NQ (Lee et al., 2019) datasets.
Dataset Splits	Yes	We use the first 200 prompts from each derivative dataset as the training set and the remaining 800 as the test set.
Hardware Specification	Yes	The computation stage that does inference in the LLM (which takes over a week on a single A100 80GB) is separated from the stage that estimates semantic entropy (which only uses the CPU, taking on the order of 12 minutes).
Software Dependencies	No	The paper mentions 'quantization settings' for LLMs (8 bit for Llama-3.3-70B, 16 bit for Mistral, 32 bit for Llama-3.2 and Llama-2) but does not list specific software dependencies like programming languages, libraries, or frameworks with version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Following the methodology of Farquhar et al. (2024), the N LLM responses are generated with temperature 1.0. On the other hand, the LLM response about which we seek to determine if it is a hallucination is generated with temperature 0.1.