Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scalable Best-of-N Selection for Large Language Models via Self-Certainty

Authors: Zhewei Kang, Xuandong Zhao, Dawn Song

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size N, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. We rigorously evaluate our methods across diverse reasoning benchmarks, including Live Bench-Math [White et al., 2024], GSM8K [Cobbe et al., 2021b], MATH [Hendrycks et al., 2021], CRUXEval [Gu et al., 2024] and Live Code Bench [Jain et al., 2024], spanning mathematical reasoning, code reasoning, and code generation. Our experiments reveal that self-certainty-based voting consistently outperforms self-consistency in Best-of-N selection of reasoning tasks, effectively adapting to varying sample sizes and question difficulties.
Researcher Affiliation Academia Zhewei Kang UC Berkeley EMAIL Xuandong Zhao UC Berkeley EMAIL Dawn Song UC Berkeley EMAIL
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It describes methodologies using narrative text, mathematical formulas, and figures illustrating examples of reasoning paths or voting mechanisms rather than formal pseudocode.
Open Source Code Yes The code is available at https://github.com/backprop07/Self-Certainty
Open Datasets Yes We rigorously evaluate our methods across diverse reasoning benchmarks, including Live Bench-Math [White et al., 2024], GSM8K [Cobbe et al., 2021b], MATH [Hendrycks et al., 2021], CRUXEval [Gu et al., 2024] and Live Code Bench [Jain et al., 2024], spanning mathematical reasoning, code reasoning, and code generation.
Dataset Splits Yes We use Live Bench-Math dataset [White et al., 2024], the validation set of GSM8K dataset [Cobbe et al., 2021b] and the test set of MATH dataset [Hendrycks et al., 2021].
Hardware Specification Yes All experiments are run on NVIDIA A100 GPUs.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers, such as Python, PyTorch, or TensorFlow versions.
Experiment Setup Yes We sample 64 responses (temperature=0.6, topp=0.9) and create subsets of N = 4, 8, 16, 32, 64 for Best-of-N selection.