Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Variational Uncertainty Decomposition for In-Context Learning

Authors: I. Shavindra Jayasekera, Jacob Si, Filippo Valdettaro, Wenlong Chen, Aldo A Faisal, Yingzhen Li

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through experiments on synthetic and realworld tasks, we show quantitatively and qualitatively that the decomposed uncertainties obtained from our method exhibit desirable properties of epistemic and aleatoric uncertainty. Code is available at: https://github.com/jacobyhsi/VUD. 1 Introduction ... 5 Experiments We evaluate the robustness and applicability of our method to classification and regression tasks. This includes ablation studies and visualisations on synthetic datasets, as well as downstream applications such as bandit problems and out-of-distribution (OOD) detection on question-answering (QA) tasks.
Researcher Affiliation Academia I. Shavindra Jayasekera , Jacob Si , Filippo Valdettaro, Wenlong Chen, A. Aldo Faisal, Yingzhen Li Imperial College London EMAIL
Pseudocode Yes E Algorithms and Pseudocode E.1 Pseudocode for Variational Uncertainty Decomposition Algorithm Algorithm 1 Multi-Class Classification for Aleatoric Uncertainty Estimation ... Algorithm 2 Regression for Aleatoric Uncertainty Estimation ... Algorithm 3 Compute Permutation Invariant Classification Distribution z : CLASSDIST ... Algorithm 4 Approximate Permutation Invariant Regression Distribution: REGDIST. ... Algorithm 5 Approximate Marginalisation of Mixture Distributions: NORMAPPROX.
Open Source Code Yes Code is available at: https://github.com/jacobyhsi/VUD.
Open Datasets Yes We apply LLM-abstention to binary classification datasets: Bool QA [15], Hotpot QA [116], and Pub Med QA [42]; as well as a multiclass classification dataset: MMLU [31].
Dataset Splits Yes In our main experiments, we leverage Bool QA [15], Hotpot QA [116], and Pub Med QA [42] interchangeably of equivalent sample size as the in-distribution (ID) and out-of-distribution (OOD) datasets [67]. We formulate these datasets as binary classification tasks (yes/no). For our reference baseline, we extend the Deep Ensembles framework [39] to our OOD detection task by ensembling the output distributions of multiple different in-context example sets. For both methods, we leverage a training set size of |D| = 15 ICL samples and a test set size of |x ID + x OOD| = 120 for our ID and OOD samples and average our experimental results across 3 seeds. For our method, we generate |Z| = 20 perturbations by prompting the LLM to rephrase with relevant context from the test sample. For Deep Ensembles, we leverage 5 different in-context learning sets.
Hardware Specification Yes G.1 Code Implementation The following delineates the foundation of our experiments: Codebase: Python & Py Torch CPU: AMD EPYC 7443P GPU: NVIDIA A6000 48GB
Software Dependencies No G.1 Code Implementation The following delineates the foundation of our experiments: Codebase: Python & Py Torch CPU: AMD EPYC 7443P GPU: NVIDIA A6000 48GB
Experiment Setup Yes We use the following LLMs in our experiments: Qwen2.5-14B/7B, [88] and Llama-3.1-8B [102]. Only for QA tasks, we use Qwen2.5-14B-Instruct. ... Prompts and sampling details are provided in Appendix H. ... Temperature: 1.0 Log Probs: 10 Max Tokens: 10 (Qwen2.5-14B/7B and Llama-3.1-8B), 512 (Qwen2.5-14B-Instruct) ... In our experiments, we choose α = 2, 5. In UCB1 smaller choices of α are typically chosen [52], however this is primarily due to the slow decay of Ut(a) in the UCB1 algorithm. The decrease in epistemic uncertainty with the number of trials is significantly faster, and therefore, we use higher α.