Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Sampling-Efficient Test-Time Scaling: Self-Estimating the Best-of-N Sampling in Early Decoding

Authors: Yiming Wang, Pei Zhang, Siyuan Huang, Baosong Yang, Zhuosheng Zhang, Fei Huang, Rui Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to demonstrate that ST-Bo N strikes a strong balance between cost and performance. When reaching the performance of Full-Bo N at the certain N values, ST-Bo N can save computational cost by 70% to 80%. Also, when consuming similar computational costs, ST-Bo N improves accuracy by 3 to 4 points. In addition, ST-Bo N is applicable across a wide range of domains. These results show that ST-Bo N can provide a flexible solution to balance cost and performance when faced with limited resources. Section 5: Experiments. Datasets. Models. Baselines. Implementation. Evaluation. Results I: Objective Task. Results II: Subjective Tasks.
Researcher Affiliation	Collaboration	Yiming Wangα Pei Zhangβ Siyuan Huangα Baosong Yangβ Zhuosheng Zhangα Fei Huangβ Rui Wangα, αSchool of Computer Science, Shanghai Jiao Tong University βTongyi Lab, Alibaba Group
Pseudocode	No	The paper describes the method using a pipeline diagram (Figure 1) and textual steps, but no explicitly labeled 'Pseudocode' or 'Algorithm' block is present.
Open Source Code	Yes	Code and data are available in https://github.com/Alsace08/ST-Bo N.
Open Datasets	Yes	Datasets. We select four datasets for objective tasks: MATH [18], Theorem QA [8], GPQA [37], and MMLU [17]. They span a range of domains, including mathematics, theorem application, science reasoning, and general knowledge, and present a significant difficulty level. We also select two datasets for subjective tasks: CNNDM [34] and Alpaca Farm [11].
Dataset Splits	Yes	Datasets. We select four datasets for objective tasks: MATH [18], Theorem QA [8], GPQA [37], and MMLU [17]. ... We also select two datasets for subjective tasks: CNNDM [34] and Alpaca Farm [11]. ... We mainly adopt 7B+ parameter models with the Zero-Shot-Co T generation paradigm [52, 23]. ... For each Di, we compute the proportion of cases where the LLM s self-estimated best sampling produces a correct final answer, representing early-final consistency. Naturally, as i decreases, the estimation becomes more challenging, and the random consistency expectation is i/N.
Hardware Specification	Yes	All experiments are run on 80G A100 GPUs, with the GPU number varying based on N.
Software Dependencies	No	All baselines are implemented using the Hugging Face Transformers [43] library's model.generate() function with KV cache. The paper does not specify version numbers for Hugging Face Transformers or any other software dependencies.
Experiment Setup	Yes	Implementation. We use the sampling strategy combining top-k [13], top-p [20], and temperature T [19], with k = 20, p = 0.95, and T = 0.7. The buffer window length τ is set to c in our main experiments. Detailed hyperparameter analysis is provided in Section 6.2 and 6.3.