reproducibilityindex.ai

Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities

Authors: Alexander Nikitin, Jannik Kossen, Yarin Gal, Pekka Marttinen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically compare our approach against baselines methods across several tasks and LLMs with up to 70B parameters (60 scenarios total), achieving So TA results (Sec. 5). Additionally, Section 5 is titled 'Experiments'.
Researcher Affiliation	Academia	Alexander Nikitin1 Jannik Kossen2 Yarin Gal2 Pekka Marttinen1 1 Department of Computer Science, Aalto University 2 OATML, Department of Computer Science, University of Oxford
Pseudocode	Yes	Algorithm 1 Kernel Language Entropy
Open Source Code	Yes	We release the code and instructions for reproducing our results at https://github.com/ Alexander VNikitin/kernel-language-entropy.
Open Datasets	Yes	We evaluate our method on the following tasks covering different domains of natural language generation: general knowledge (Trivia QA [29] and SQu AD [62]), biology and medicine (Bio ASQ [35]), general domain questions from Google search (Natural Questions, NQ [38]), and natural language math problems (SVAMP [60]).
Dataset Splits	Yes	We compare the strategies of hyperparameter selection from Sec. 3.2: entropy convergence plots and validation sets (100 samples per dataset except for SVAMP, where we used default hyperparameters).
Hardware Specification	Yes	We ran Llama 2 70B models on two NVIDIA A100 80GB GPUs, and the rest of the models on a single NVIDIA A100 80GB.
Software Dependencies	No	The paper mentions using 'DeBERTa-Large-MNLI [22]' as the NLI model, but it does not specify exact version numbers for this or any other software dependencies like Python, PyTorch, or other libraries.
Experiment Setup	Yes	Sampling. We sample 10 answers per input via top-K sampling with K = 50 and nucleus sampling with p = 0.9 at temperature T = 1. To assess model accuracy, we draw an additional low-temperature sample (T = 0.1)...