Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Hallucination Detection on a Budget: Efficient Bayesian Estimation of Semantic Entropy
Authors: Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically that our approach systematically beats the baselines, requiring only 53% of samples used by Farquhar et al. (2024) to achieve the same quality of hallucination detection as measured by AUROC. |
| Researcher Affiliation | Industry | Kamil Ciosek EMAIL Spotify Nicolò Felicioni EMAIL Spotify Sina Ghiassian EMAIL Spotify |
| Pseudocode | Yes | We summarize the ideas introduced in Sections 3.1, 3.2 and 3.3 in Algorithm 1. Algorithm 1 Estimate of Semantic Entropy for a prompt x. |
| Open Source Code | No | We will release the source code for both stages upon acceptance. |
| Open Datasets | Yes | We used the Trivia QA (Joshi et al., 2017), SQUAD (Rajpurkar et al., 2016), SVAMP (Patel et al., 2021) and NQ (Lee et al., 2019) datasets. |
| Dataset Splits | Yes | We use the first 200 prompts from each derivative dataset as the training set and the remaining 800 as the test set. |
| Hardware Specification | Yes | The computation stage that does inference in the LLM (which takes over a week on a single A100 80GB) is separated from the stage that estimates semantic entropy (which only uses the CPU, taking on the order of 12 minutes). |
| Software Dependencies | No | The paper mentions 'quantization settings' for LLMs (8 bit for Llama-3.3-70B, 16 bit for Mistral, 32 bit for Llama-3.2 and Llama-2) but does not list specific software dependencies like programming languages, libraries, or frameworks with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Following the methodology of Farquhar et al. (2024), the N LLM responses are generated with temperature 1.0. On the other hand, the LLM response about which we seek to determine if it is a hallucination is generated with temperature 0.1. |