Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Majority of the Bests: Improving Best-of-N via Bootstrapping

Authors: Amin Rakhsha, Kanika Madan, Tianyu Zhang, Amir-massoud Farahmand, Amir Khasahmadi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over Bo N in 25 out of 30 setups. We conducted a series of experiments to compare the performance of our proposed method against other well-known sample-and-marginalize approaches across a range of datasets, generative models, and reward models.
Researcher Affiliation Collaboration 1University of Toronto 2Vector Institute 3Autodesk 4Polytechnique Montréal 5Mila Quebec AI Institute
Pseudocode No We provide a procedure to adaptively select m, eliminating any critical hyperparameters from the algorithm. In the supplementary material, we provide an even more efficient way of estimating ˆπm,N with O(N log N) complexity that finds ZMo B m,N directly and without creating B datasets. The paper describes a procedure and mentions an efficient way of estimating, but does not present a clearly labeled pseudocode or algorithm block with structured steps.
Open Source Code Yes Code and data available at https://github.com/arakhsha/mob
Open Datasets Yes The datasets include MATH500 (Lightman et al., 2023), GSM8K (Cobbe et al., 2021b), MMLU-Pro (Wang et al., 2024b) questions in math (MMLU-Pro-Math) and chemistry (MMLU-Pro-Chem), and Common Sense QA (Talmor et al., 2018).
Dataset Splits Yes MATH500, first introduced by Lightman et al. (2023), is a randomly sampled subset of 500 math questions with short final answers from the MATH dataset (Hendrycks et al., 2021). For all benchmarks, we randomly select 500 questions for our experiments.
Hardware Specification Yes The generation was carried on H100 GPUs.
Software Dependencies No We use Huggingface s Python library for all the output generations. We use the Scipy library (Virtanen et al., 2020) in python to conduct one-sided paired t-test. The paper mentions software libraries and programming language but does not provide specific version numbers for them.
Experiment Setup Yes We always use temperature 1 for inference and no extra modification of the next-token sampling procedure. The final answer extraction and evaluation are calculated using the Language Model Evaluation Harness (Gao et al., 2024). For each question, we generate 512 outputs and for each budget size N, we run each algorithm 512/N times. For GSM8K, we use a 5-shot prompt. For MATH and MMLU-Pro questions, we use the zero-shot chain-of-thought prompting used in the official Llama3.1 models evaluation (Grattafiori et al., 2024) on MATH (Hendrycks et al., 2021).