Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Authors: Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiments In this section, we first provide the experimental setup, then investigate whether Conf Tuner learns effective verbalized confidence estimation and how this capability enables more trustworthy LLM systems. Finally, we compare the training/inference time and training data size, demonstrating the efficiency of Conf Tuner.
Researcher Affiliation	Academia	Yibo Li National University of Singapore EMAIL Miao Xiong National University of Singapore EMAIL Jiaying Wu National University of Singapore EMAIL Bryan Hooi National University of Singapore EMAIL
Pseudocode	No	The paper describes the Conf Tuner algorithm in Section 3 and its two key steps, but does not present it in a structured pseudocode or algorithm block.
Open Source Code	Yes	The code is available at https://github.com/liushiliushi/Conf Tuner.
Open Datasets	Yes	Datasets. Following [33], we use Hotpot QA [35] for training... For evaluation, besides the evaluation set of Hotpot QA, we also adopt: 1) Trivia QA [15]... 2) Strategy QA [9]... 3) GSM8K [6]... 4) Truthful QA [23].
Dataset Splits	Yes	For evaluation, besides the evaluation set of Hotpot QA, we also adopt: 1) Trivia QA [15]... following [29], we sample 1,000 for evaluation. 3) GSM8K [6], a benchmark... Here we sample 1,000 for evaluation.
Hardware Specification	Yes	The experiments are run with 6 Nvidia A40 GPUs. For fair comparison, training was conducted on 4 A40 GPUs and inference on a single A40 GPU.
Software Dependencies	No	The models are implemented with the Huggingface Transformers (https:// huggingface.co/) library. For evaluation, we use the vllm (https://github.com/vllm-project/vllm) library. Specific version numbers for these libraries are not provided.
Experiment Setup	Yes	For LLa MA, the optimal configuration was determined to be a learning rate of 1e-5, 2 training epochs, and a batch size of 16. The Ministral achieved peak performance with a slightly higher learning rate of 3e-5, 2 epochs, and the same batch size of 16. Meanwhile, the Qwen model required an extended training regimen of 3 epochs and a larger batch size of 24, paired with a learning rate of 1e-5.