Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities

Authors: Alexander Nikitin, Jannik Kossen, Yarin Gal, Pekka Marttinen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically compare our approach against baselines methods across several tasks and LLMs with up to 70B parameters (60 scenarios total), achieving So TA results (Sec. 5). Additionally, Section 5 is titled 'Experiments'.
Researcher Affiliation Academia Alexander Nikitin1 Jannik Kossen2 Yarin Gal2 Pekka Marttinen1 1 Department of Computer Science, Aalto University 2 OATML, Department of Computer Science, University of Oxford
Pseudocode Yes Algorithm 1 Kernel Language Entropy
Open Source Code Yes We release the code and instructions for reproducing our results at https://github.com/ Alexander VNikitin/kernel-language-entropy.
Open Datasets Yes We evaluate our method on the following tasks covering different domains of natural language generation: general knowledge (Trivia QA [29] and SQu AD [62]), biology and medicine (Bio ASQ [35]), general domain questions from Google search (Natural Questions, NQ [38]), and natural language math problems (SVAMP [60]).
Dataset Splits Yes We compare the strategies of hyperparameter selection from Sec. 3.2: entropy convergence plots and validation sets (100 samples per dataset except for SVAMP, where we used default hyperparameters).
Hardware Specification Yes We ran Llama 2 70B models on two NVIDIA A100 80GB GPUs, and the rest of the models on a single NVIDIA A100 80GB.
Software Dependencies No The paper mentions using 'DeBERTa-Large-MNLI [22]' as the NLI model, but it does not specify exact version numbers for this or any other software dependencies like Python, PyTorch, or other libraries.
Experiment Setup Yes Sampling. We sample 10 answers per input via top-K sampling with K = 50 and nucleus sampling with p = 0.9 at temperature T = 1. To assess model accuracy, we draw an additional low-temperature sample (T = 0.1)...