Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scalable Best-of-N Selection for Large Language Models via Self-Certainty
Authors: Zhewei Kang, Xuandong Zhao, Dawn Song
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. We rigorously evaluate our methods across diverse reasoning benchmarks, including Live Bench-Math [White et al., 2024], GSM8K [Cobbe et al., 2021b], MATH [Hendrycks et al., 2021], CRUXEval [Gu et al., 2024] and Live Code Bench [Jain et al., 2024], spanning mathematical reasoning, code reasoning, and code generation. Our experiments reveal that self-certainty-based voting consistently outperforms self-consistency in Best-of-N selection of reasoning tasks, effectively adapting to varying sample sizes and question difficulties. |
| Researcher Affiliation | Academia | Zhewei Kang UC Berkeley EMAIL Xuandong Zhao UC Berkeley EMAIL Dawn Song UC Berkeley EMAIL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies using narrative text, mathematical formulas, and figures illustrating examples of reasoning paths or voting mechanisms rather than formal pseudocode. |
| Open Source Code | Yes | The code is available at https://github.com/backprop07/Self-Certainty |
| Open Datasets | Yes | We rigorously evaluate our methods across diverse reasoning benchmarks, including Live Bench-Math [White et al., 2024], GSM8K [Cobbe et al., 2021b], MATH [Hendrycks et al., 2021], CRUXEval [Gu et al., 2024] and Live Code Bench [Jain et al., 2024], spanning mathematical reasoning, code reasoning, and code generation. |
| Dataset Splits | Yes | We use Live Bench-Math dataset [White et al., 2024], the validation set of GSM8K dataset [Cobbe et al., 2021b] and the test set of MATH dataset [Hendrycks et al., 2021]. |
| Hardware Specification | Yes | All experiments are run on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers, such as Python, PyTorch, or TensorFlow versions. |
| Experiment Setup | Yes | We sample 64 responses (temperature=0.6, topp=0.9) and create subsets of N = 4, 8, 16, 32, 64 for Best-of-N selection. |