Kernel Language Entropy: Fine-grained Uncertainty Quantification for LLMs from Semantic Similarities
Authors: Alexander Nikitin, Jannik Kossen, Yarin Gal, Pekka Marttinen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically compare our approach against baselines methods across several tasks and LLMs with up to 70B parameters (60 scenarios total), achieving So TA results (Sec. 5). Additionally, Section 5 is titled 'Experiments'. |
| Researcher Affiliation | Academia | Alexander Nikitin1 Jannik Kossen2 Yarin Gal2 Pekka Marttinen1 1 Department of Computer Science, Aalto University 2 OATML, Department of Computer Science, University of Oxford |
| Pseudocode | Yes | Algorithm 1 Kernel Language Entropy |
| Open Source Code | Yes | We release the code and instructions for reproducing our results at https://github.com/ Alexander VNikitin/kernel-language-entropy. |
| Open Datasets | Yes | We evaluate our method on the following tasks covering different domains of natural language generation: general knowledge (Trivia QA [29] and SQu AD [62]), biology and medicine (Bio ASQ [35]), general domain questions from Google search (Natural Questions, NQ [38]), and natural language math problems (SVAMP [60]). |
| Dataset Splits | Yes | We compare the strategies of hyperparameter selection from Sec. 3.2: entropy convergence plots and validation sets (100 samples per dataset except for SVAMP, where we used default hyperparameters). |
| Hardware Specification | Yes | We ran Llama 2 70B models on two NVIDIA A100 80GB GPUs, and the rest of the models on a single NVIDIA A100 80GB. |
| Software Dependencies | No | The paper mentions using 'DeBERTa-Large-MNLI [22]' as the NLI model, but it does not specify exact version numbers for this or any other software dependencies like Python, PyTorch, or other libraries. |
| Experiment Setup | Yes | Sampling. We sample 10 answers per input via top-K sampling with K = 50 and nucleus sampling with p = 0.9 at temperature T = 1. To assess model accuracy, we draw an additional low-temperature sample (T = 0.1)... |