Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How Far Are We from Optimal Reasoning Efficiency?

Authors: Jiaxuan Gao, Shu Yan, Qixin Tan, lu Yang, Shusheng Xu, Wei Fu, Zhiyu Mei, Kaifeng Lyu, YI WU

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Systematic evaluation on challenging mathematical benchmarks, AMC23, AIME24, and AIME25, reveals significant gaps in current methods: they either sacrifice accuracy for short length or use excessive tokens to achieve sub-optimal accuracies despite high overall accuracy. To reduce the efficiency gap, we propose REO-RL, a Reinforcement Learning algorithm that optimizes reasoning efficiency by targeting a sparse set of token budgets. Experiments show that, compared to vanilla RL with outcome reward, REO-RL reduces the reasoning efficiency gap by 74.5% and 64.2% in the 1.5B and 7B settings. Ablation studies confirm the efficacy of our token budget strategy and highlight REO-RL s flexibility across design choices.
Researcher Affiliation	Collaboration	1 IIIS, Tsinghua University 2 Ant Group 3 Nanjing University
Pseudocode	No	The paper describes algorithms and methodologies in prose and mathematical equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We provide our code in the https://anonymous.4open.science/r/REO-RL-803F. Please refer to Sec. F for the details on reasoning efficiency frontiers and reasoning efficiency gap, and Sec. C for the implementation detials.
Open Datasets	Yes	We evaluate on three challenging mathematical reasoning benchmarks: AMC 2023, AIME 2024, and AIME 2025. More training details can be found in Sec. 6.1 and Appendix. C. We conducted large-scale RL experiments spanning 8 algorithms and 15 training configurations, resulting in 180+ and 210+ models for 1.5B and 7B scales, respectively. For training, we adopt a mixture of training data consisting of 135k problems sourced from Deep Scale R Luo et al. [2025c] and ARea L RL Lab [2025]
Dataset Splits	No	The paper specifies the training data as a mixture of problems from Deep Scale R and ARea L, and evaluation on AMC 2023, AIME 2024, and AIME 2025 benchmarks. However, it does not provide specific percentages or counts for train/validation/test splits within any single dataset.
Hardware Specification	Yes	Cluster Config 8 × 8 H800 (for 1.5B) / 16 × 8 H800 (for 7B)
Software Dependencies	No	We implement the training algorithm with the ARea L framework [RL Lab, 2025], which supports SGLang [?] for rollout generation. However, no specific version numbers for these frameworks or other key software libraries are provided.
Experiment Setup	Yes	The default training setting and hyperparameters for PPO training are listed in Tab. 4. The default training configurations and hyperparameters for SFT are listed in Tab. 5. The default training configurations and hyperparameters for Sim PO are listed in Tab. 6.