Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Efficient semantic uncertainty quantification in language models via diversity-steered sampling

Authors: Ji Won Park, Kyunghyun Cho

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments evaluate the method s ability to quantify established proxies for aleatoric and epistemic uncertainties, demonstrating improved uncertainty estimation across diverse NLP tasks. Practical enhancements, including adaptive tuning of the diversity hyperparameter and online stopping based on estimator stability, further improve sample efficiency.
Researcher Affiliation	Collaboration	Ji Won Park Prescient Design, Genentech EMAIL Kyunghyun Cho Prescient Design, Genentech Center for Data Science, New York University EMAIL
Pseudocode	Yes	Algorithm 1 Diversity steered autoregressive sampling Algorithm 2 Diversity steered masked diffusion sampling
Open Source Code	Yes	Our sampling pipeline is implemented at https://github.com/jiwoncpark/diversity_ steered_sampling .
Open Datasets	Yes	We perform experiments on four question-answering (QA) benchmark datasets covering both closedand open-book tasks: 907 validation matched instances with shorter stories from Co QA [62], a closed-book abstractive QA; 1,000 instances from the validation no-context reading comprehension split of Trivia QA [63], a closed-book extractive QA; 800 instances from the validation split of Truthful QA [64], a closed-book generative QA; and the light validation split of Ambig QA [65], an open-book open-domain QA.
Dataset Splits	Yes	We perform experiments on four question-answering (QA) benchmark datasets covering both closedand open-book tasks: 907 validation matched instances with shorter stories from Co QA [62], a closed-book abstractive QA; 1,000 instances from the validation no-context reading comprehension split of Trivia QA [63], a closed-book extractive QA; 800 instances from the validation split of Truthful QA [64], a closed-book generative QA; and the light validation split of Ambig QA [65], an open-book open-domain QA.
Hardware Specification	Yes	All experiments were conducted on an NVIDIA A100 GPU, with each sampling scheme requiring under 32 GB of VRAM.
Software Dependencies	No	The paper mentions specific model checkpoints like "microsoft/deberta-large-mnli checkpoint" and optimizers like "Adam W", but does not provide explicit version numbers for core software dependencies such as Python, PyTorch/TensorFlow, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	Optimization uses Adam W [73] (initial learning rate 5 10 5, weight decay 0.01) with a batch size of 8 for two epochs. ... Empirically, Etarget = 0.3 and λ0 = 0 work well.