Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling

Authors: Xinglin Wang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Yueqi Zhang, Jiayi Shi, Chuyi Tan, Boyuan Pan, Yao Hu, Prof. Kan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To demonstrate DORA s effectiveness, we conduct extensive experiments on challenging mathematical reasoning benchmarks including MATH500, AIME2024, and AIME2025. The empirical results show that DORA consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art accuracy.
Researcher Affiliation	Collaboration	Xinglin Wang1, Yiwei Li1, Shaoxiong Feng2 , Peiwen Yuan1, Yueqi Zhang1, Jiayi Shi1, Chuyi Tan1, Boyuan Pan2, Yao Hu2, Kan Li1 1 School of Computer Science, Beijing Institute of Technology 2 Xiaohongshu Inc
Pseudocode	Yes	The parallel search process can be summarized as Algorithm 1. Specifically, the process iteratively expands a set of partial solutions using the policy π, collects complete solutions, and redistributes the rollout budget via the allocation strategy O based on intermediate rewards from Q. Once sufficient complete solutions are gathered, the final answer is selected using the voting method V .
Open Source Code	Yes	1Our code and data have been released on https://github.com/Wang Xinglin/DORA.
Open Datasets	Yes	To validate the effectiveness of DORA, we evaluate it on the challenging mathematical benchmarks MATH500 (Hendrycks et al., 2021), AIME2024 (AI-MO, 2024), and AIME2025 across a broad range of rollout budgets and policy models.
Dataset Splits	No	We evaluate models under rollout budgets of 16, 32, 64, 128, and 256 on the main benchmarks. Following Hochlehnert et al. (2025), we repeat all experiments five times on MATH500 and ten times on AIME2024 and AIME2025, reporting the average performance across all runs to reduce the impact of randomness and improve the reliability of our conclusions.
Hardware Specification	Yes	All experiments are executed in parallel on a cluster with 32 NVIDIA A100 GPUs (40G), where each individual run is allocated to a single GPU.
Software Dependencies	No	All experiments use temperature sampling with temperature = 0.8 and top_p = 1.0. We set the token limit to 256 per step and 2048 tokens in total for each solution. For Beam Search and DVTS, we use a beam width of 4 following Snell et al. (2024). For REBASE, we set its Tb to 0.1, consistent with its original implementation. For DORA, we employ the open-source BGE-M3 embedding model (Chen et al., 2024a) to compute semantic similarity between trajectories, chosen for its lightweight architecture, strong empirical performance, and ability to handle long input sequences. We set the Tb for quality scores to 0.1 (matching REBASE), and the semantic similarity temperature Ts to 0.01.
Experiment Setup	Yes	All experiments use temperature sampling with temperature = 0.8 and top_p = 1.0. We set the token limit to 256 per step and 2048 tokens in total for each solution. For Beam Search and DVTS, we use a beam width of 4 following Snell et al. (2024). For REBASE, we set its Tb to 0.1, consistent with its original implementation. For DORA, we employ the open-source BGE-M3 embedding model (Chen et al., 2024a) to compute semantic similarity between trajectories, chosen for its lightweight architecture, strong empirical performance, and ability to handle long input sequences. We set the Tb for quality scores to 0.1 (matching REBASE), and the semantic similarity temperature Ts to 0.01.