Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Agentic RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

Authors: Xinji Mai, Haotian Xu, Xing W, Weinong Wang, Yingying Zhang, Wenqiang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement a robust framework featuring a decoupled code execution environment and validate our findings across standard RL algorithms and frameworks. Experiments show Zero TIR significantly surpasses non-tool Zero RL baselines on challenging math benchmarks. Our experiments primarily utilize the Qwen 2.5 Base 7B/32B model, starting directly from pre-trained weights to align with the Zero RL philosophy. We implement our Zero TIR approach using standard community frameworks Open RLHF and Open-Reasoner-Zero, and evaluate key RL algorithms including PPO and Reinforce++. The training dataset consists of ORZ-57k[22] and deepmath[23] dataset containing verifiable mathematical problems. We call the model trained in this way ZTRL. Model performance is evaluated on a suite of standard mathematical reasoning benchmarks such as MATH500[24], AIME24/25[25, 26], HMMT Feb. 24/25[27], cmimc[28], olymemath[29] and so on, which are some of the most difficult mathematical data sets out there.
Researcher Affiliation	Collaboration	Xinji Mai1,2, Haotian Xu2, Xing W2 Weinong Wang2 Yingying Zhang3, Wenqiang Zhang1,4, 1College of Intelligent Robotics and Advanced Manufacturing, Fudan University 2Xiaohongshu 3East China Normal University 4Shanghai Key Lab of Intelligent Information Processing, College of Computer Science and Artificial Intelligence, Fudan University EMAIL,EMAIL
Pseudocode	Yes	Algorithm 1 Zero TIR Rollout with Spontaneous Code Calls Require: policy π, prompt P, code env E, call budget N
Open Source Code	Yes	Code is released at https://github.com/yyht/openrlhf_async_pipline.
Open Datasets	Yes	The training dataset consists of ORZ-57k[22] and deepmath[23] dataset containing verifiable mathematical problems. We call the model trained in this way ZTRL. Model performance is evaluated on a suite of standard mathematical reasoning benchmarks such as MATH500[24], AIME24/25[25, 26], HMMT Feb. 24/25[27], cmimc[28], olymemath[29] and so on, which are some of the most difficult mathematical data sets out there.
Dataset Splits	No	The paper mentions using ORZ-57k and deepmath datasets for training and evaluates on standard mathematical reasoning benchmarks like MATH500, AIME24/25, etc. However, it does not explicitly provide specific train/test/validation splits (e.g., percentages, sample counts, or citations to predefined splits) for these datasets within the paper's text. While evaluation benchmarks often have standard splits, the paper does not detail them for its own experimental setup.
Hardware Specification	No	The paper mentions using Qwen 2.5 Base 7B/32B models but does not specify the particular hardware (e.g., GPU models, CPU types, memory) used for running the experiments or training the models. It only generally refers to 'standard web serving technologies' for the decoupled code execution environment without hardware specifics.
Software Dependencies	No	The paper states, 'We implement our Zero TIR approach using standard community frameworks Open RLHF and Open-Reasoner-Zero, and evaluate key RL algorithms including PPO and Reinforce++.' While it names frameworks and algorithms, it does not provide specific version numbers for these software components or for other mentioned tools like Flask, Gunicorn, Nginx, or aiolimiter.
Experiment Setup	Yes	Key RL hyperparameters include a rollout batch size of 128, with 16 samples generated per prompt. We use 1 policy update step and 12 critic update steps per iteration. Micro-batch sizes for training and forward passes are set to 1. Stability and efficiency techniques, including group-accuracy replay buffer filtering and dynamic stop-token based interaction (detailed in Section 3.2), are employed. The decoupled code execution environment (Section 3.3) handles all tool calls. For initial scaling law validation experiments, the maximum tool calls per trajectory were limited (Nmax = 20) for efficiency.The evaluation metrics include greedy decoding (temperature=0), majority voting, pass@k, and the final performance measured under different top-p sampling settings (temperature=1).