Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Know What You Don't Know: Uncertainty Calibration of Process Reward Models
Authors: Young-Jin Park, Kristjan Greenewald, Kaveh Alimohammadi, Hao Wang, Navid Azizan
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired. |
| Researcher Affiliation | Collaboration | Young-Jin Park1, Kristjan Greenewald2, Kaveh Alim1, Hao Wang2,3, Navid Azizan1 1Massachusetts Institute of Technology 2MIT-IBM Watson AI Lab 3Red Hat AI Innovation |
| Pseudocode | Yes | Algorithm 1 One-Sided Split Conformal Quantile Regression (specialized from [38]) |
| Open Source Code | Yes | To ensure full reproducibility and facilitate future research, we publicly release our complete codebase, including all prompt templates, sampling configurations, and evaluation scripts. Our calibration datasets comprising q, x(q,i) 0:t , p(q,i,t) triplets of question, prefix trajectory, and empirical success probability are available for multiple reasoning LLMs at https://huggingface.co/datasets/young-j-park/prm_calibration. |
| Open Datasets | Yes | We evaluate our method on two mathematical-reasoning benchmarks MATH500 [17] and AIME24-25 (i.e., AIME2024 and AIME2025) using six LLMs: Llama-3.2-1B & 3.1-8B-Instruct [47], Qwen2.5-Math-1.5B & 7B-Instruct [55], Deep Seek-R1-Distill-Llama & Qwen-8B [9]. Our calibration datasets comprising q, x(q,i) 0:t , p(q,i,t) triplets of question, prefix trajectory, and empirical success probability are available for multiple reasoning LLMs at https://huggingface.co/datasets/young-j-park/prm_calibration. |
| Dataset Splits | Yes | Our validation set is defined as q Qval{x(q,1), x(q,2), . . . , x(q,Nval)}, where Qval represents a curated set of validation questions; in our experiments, we sample 500 random questions from MATH training split. |
| Hardware Specification | Yes | All simulations are conducted using NVIDIA V100 32GB SMX3 devices with the v LLM inference acceleration framework [26]. Fine-tuning is performed using NVIDIA A100 40GB SXM4 GPUs and the TRL library [49]. |
| Software Dependencies | No | The paper mentions "v LLM inference acceleration framework [26]" and "TRL library [49]" but does not specify their version numbers, which is required for reproducibility. |
| Experiment Setup | Yes | We employ the standard sampling configuration: top_p = 1.0, top_k = -1 (considering all tokens in the vocabulary), and temperature T = 0.7 for all LLMs, consistent with best practices in recent reasoning model studies [2, 37]. Specifically, we apply Lo RA [19] with a rank of 2, a dropout rate of 0.1, and a scaling factor of 32. We adopt a conservative setting with C = 0.99 and β = 0.1 (see Appendix E.5 for an ablation study). |