Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Authors: Beier Luo, Shuoyuan Wang, Sharon Li, Hongxin Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments with both open-sourced and API-based LLMs on common benchmarks demonstrate the effectiveness of the DACA method for confidence calibration. Notably, DACA achieves performance comparable to labeled temperature scaling, even in the absence of labeled data. For example, DACA improves the average Expected Calibration Error (ECE) of the Gemma-3-12BInstruct model [Team et al., 2025] across 57 subjects of the MMLU dataset [Hendrycks et al., 2021], reducing it from 23.68% to 8.60%. In comparison, TS only reduces the ECE to 9.75%.
Researcher Affiliation	Academia	1Department of Statistics and Data Science, Southern University of Science and Technology 2Department of Computer Sciences, University of Wisconsin-Madison Corresponding author (EMAIL)
Pseudocode	No	The paper describes the method DACA and its loss function (Equation 7 and 8) in prose and mathematical formulations within Section 3 ("Motivation and Method") but does not provide a clearly labeled pseudocode block or algorithm section.
Open Source Code	Yes	Codes are publicly available at https://github.com/ml-stat-Sustech/Disagreement-Aware-Calibration.
Open Datasets	Yes	Datasets. To verify the effectiveness of our proposed methods, we employ three common datasets for evaluations, including: MMLU [Hendrycks et al., 2021], Med MCQA [Pal et al., 2022], and Math QA [Amini et al., 2019]. ... The datasets are provided by Hugging Face.
Dataset Splits	Yes	For the main experiments, we apply confidence calibration to each of the 57 subjects from MMLU and report the average of the calibration metrics. Specifically, we use the validation split of each subject as the validation set. For the MMLU datasets, we conduct five experiments with five different prompts to calculate the mean and standard deviation of the results, as the validation and test splits are predetermined. ... For other datasets, we use the first prompt and report the mean and standard deviation over five random splits of the validation and test sets, with a test-to-validation ratio of 7:3.
Hardware Specification	Yes	Experiment details. We run our experiments on NVIDIA Ge Force RTX 4090 and NVIDIA L40 GPU, and implement all methods by Py Torch and v LLM.
Software Dependencies	No	The paper mentions "Py Torch and v LLM" but does not provide specific version numbers for these software components, which is necessary for a reproducible description of ancillary software.
Experiment Setup	Yes	Optimizer details. For both TS and DACA, we use the Adam optimizer with a batch size of 256, a learning rate of 0.05, and train for 400 epochs.