Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Authors: Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency. |
| Researcher Affiliation | Collaboration | 1Sea AI Lab 2National University of Singapore |
| Pseudocode | No | The paper describes methods using text and mathematical equations in Section 2 'Methodology' but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://github.com/sail-sg/Anytime Reasoner |
| Open Datasets | Yes | We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch... After training, we assess the final model using five benchmarks: AIME2024 [Li et al., 2024a], AMC2022 [Li et al., 2024a], MATH500 [Hendrycks et al., 2021], Minerva Math [Lewkowycz et al., 2022], and Olympiad Bench [He et al., 2024]... |
| Dataset Splits | Yes | We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch, using a batch size of 64 questions per policy iteration. ... After training, we assess the final model using five benchmarks: AIME2024 [Li et al., 2024a], AMC2022 [Li et al., 2024a], MATH500 [Hendrycks et al., 2021], Minerva Math [Lewkowycz et al., 2022], and Olympiad Bench [He et al., 2024]... |
| Hardware Specification | Yes | Our experiments were conducted on 8 NVIDIA A100 80G GPUs, with each experiment taking approximately 30 hours to complete (less than 10% overhead in total compared to GRPO). |
| Software Dependencies | No | We implement our algorithms based on the Verl framework [Sheng et al., 2024], incorporating several key modifications as detailed in Appendix B. We employ Proximal Policy Optimization (PPO) [Schulman et al., 2017] to optimize both thinking and summary policies. |
| Experiment Setup | Yes | During training, we allocate four token budgets (m = 4) for thinking: {2000, 4000, 6000, 8000}. For each question, we sample a group of 8 complete thinking processes (stopped either by </think> or when exceeding 8000 tokens). We sample 4 answers to calculate the average score at each thinking budget... The summary length is restricted to 128 tokens. We fine-tuned Deep Seek-R1-Distill-Qwen-1.5B [Guo et al., 2025] on 40,315 math problems from Deep Scale R [Luo et al., 2025] for a single epoch, using a batch size of 64 questions per policy iteration. |