Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Models Better Express Their Confidence

Authors: Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models...
Researcher Affiliation Collaboration Dongkeun Yoon1 Seungone Kim3 Sohee Yang4 Sunkyoung Kim2 Soyeon Kim2 Yongil Kim2 Eunbi Choi2 Yireun Kim2 Minjoon Seo1 1KAIST 2LG AI Research 3CMU 4UCL
Pseudocode No The paper includes 'Listing 1', 'Listing 2', 'Listing 3', and 'Listing 4' which provide prompts and examples, but not structured pseudocode or algorithm blocks for the main methodology.
Open Source Code Yes Our code is available at https://github.com/Matt Yoon/reasoning-models-confidence
Open Datasets Yes We use the knowledge-focused datasets, Trivia QA and Nonambig QA [15, 23, 17]... MMLU-Pro and Super GPQA [42, 21]... Super GPQA [21]: ODC-BY, https://huggingface.co/datasets/m-a-p/Super GPQA; MMLU-Pro [42]: MIT, https://huggingface.co/datasets/TIGER-Lab/MMLU-Pro; Nonambig QA [17, 23]: CC-BY-SA-3.0, https://github.com/shmsw25/Ambig QA; Trivia QA [15]: Apache-2.0, https://huggingface.co/datasets/mandarjoshi/trivia_qa
Dataset Splits Yes We use two subsets of each reasoning dataset: a Math subset focused on arithmetic reasoning, and a Non-Math subset covering other types of reasoning. Due to our broad range of experiments, we uniformly sample 1,000 examples from each dataset or subset to keep the compute manageable. To assess variability, we perform bootstrapping by generating five resampled subsets of 1,000 examples for each dataset and report the standard deviation across runs (Table 10).
Hardware Specification Yes We conduct our experiments on machines equipped with either Nvidia A6000 48GBs or A100 80GBs GPUs. For evaluating 32B-scale models, we use two GPUs.
Software Dependencies No The paper mentions 'Leveraging vLLM [18] for efficient inference' and 'we use GPT-4.1 [29]', but does not provide specific version numbers for these or any other software components used in the experimental setup.
Experiment Setup Yes In a single turn of conversation, we instruct the models to perform three steps using Co T: (1) SOLUTION REASONING... (2) CONFIDENCE REASONING... (3) CONFIDENCE VERBALIZATION, where it maps their confidence in one of ten bins, ranging from Almost no chance (0 0.1) to Almost certain (0.9 1.0)... we use greedy decoding, with the maximum token length set to 4096 for knowledge-focused datasets and 8192 for reasoning-intensive datasets. The full prompt we use is included in Appendix B.2.