Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
Authors: Alexander Nikitin, Jannik Kossen, Yarin Gal, Pekka Marttinen
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically compare our approach against baselines methods across several tasks and LLMs with up to 70B parameters (60 scenarios total), achieving So TA results (Sec. 5). Additionally, Section 5 is titled 'Experiments'. |
| Researcher Affiliation | Academia | Alexander Nikitin1 Jannik Kossen2 Yarin Gal2 Pekka Marttinen1 1 Department of Computer Science, Aalto University 2 OATML, Department of Computer Science, University of Oxford |
| Pseudocode | Yes | Algorithm 1 Kernel Language Entropy |
| Open Source Code | Yes | We release the code and instructions for reproducing our results at https://github.com/ Alexander VNikitin/kernel-language-entropy. |
| Open Datasets | Yes | We evaluate our method on the following tasks covering different domains of natural language generation: general knowledge (Trivia QA [29] and SQu AD [62]), biology and medicine (Bio ASQ [35]), general domain questions from Google search (Natural Questions, NQ [38]), and natural language math problems (SVAMP [60]). |
| Dataset Splits | Yes | We compare the strategies of hyperparameter selection from Sec. 3.2: entropy convergence plots and validation sets (100 samples per dataset except for SVAMP, where we used default hyperparameters). |
| Hardware Specification | Yes | We ran Llama 2 70B models on two NVIDIA A100 80GB GPUs, and the rest of the models on a single NVIDIA A100 80GB. |
| Software Dependencies | No | The paper mentions using 'DeBERTa-Large-MNLI [22]' as the NLI model, but it does not specify exact version numbers for this or any other software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Sampling. We sample 10 answers per input via top-K sampling with K = 50 and nucleus sampling with p = 0.9 at temperature T = 1. To assess model accuracy, we draw an additional low-temperature sample (T = 0.1)... |