Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Authors: Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.
Researcher Affiliation	Collaboration	1Sea AI Lab 2National University of Singapore
Pseudocode	No	The paper describes methods using text and mathematical equations in Section 2 'Methodology' but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	https://github.com/sail-sg/Anytime Reasoner
Open Datasets	Yes	We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch... After training, we assess the final model using five benchmarks: AIME2024 [Li et al., 2024a], AMC2022 [Li et al., 2024a], MATH500 [Hendrycks et al., 2021], Minerva Math [Lewkowycz et al., 2022], and Olympiad Bench [He et al., 2024]...
Dataset Splits	Yes	We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch, using a batch size of 64 questions per policy iteration. ... After training, we assess the final model using five benchmarks: AIME2024 [Li et al., 2024a], AMC2022 [Li et al., 2024a], MATH500 [Hendrycks et al., 2021], Minerva Math [Lewkowycz et al., 2022], and Olympiad Bench [He et al., 2024]...
Hardware Specification	Yes	Our experiments were conducted on 8 NVIDIA A100 80G GPUs, with each experiment taking approximately 30 hours to complete (less than 10% overhead in total compared to GRPO).
Software Dependencies	No	We implement our algorithms based on the Verl framework [Sheng et al., 2024], incorporating several key modifications as detailed in Appendix B. We employ Proximal Policy Optimization (PPO) [Schulman et al., 2017] to optimize both thinking and summary policies.
Experiment Setup	Yes	During training, we allocate four token budgets (m = 4) for thinking: {2000, 4000, 6000, 8000}. For each question, we sample a group of 8 complete thinking processes (stopped either by </think> or when exceeding 8000 tokens). We sample 4 answers to calculate the average score at each thinking budget... The summary length is restricted to 128 tokens. We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch, using a batch size of 64 questions per policy iteration.