Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
Authors: Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. |
| Researcher Affiliation | Industry | Equal contribution. Correspondence to: EMAIL |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, step-by-step procedure blocks in such a format. |
| Open Source Code | Yes | We open-source model weights and data for reproducing our results at https://huggingface.co/collections/ nvidia/acereason-682f4e1261dc22f697fd1485. |
| Open Datasets | Yes | We open-source the dataset for the benefit of the community at https://huggingface.co/datasets/nvidia/Ace Reason-Math |
| Dataset Splits | Yes | At the 24K response length RL stage, we build {Easy, Medium, Hard}-prompt sets based on difficulty estimated by 7B model s performance over 16 attempts. Easy prompt set includes prompts except those solved more than 14 times, Medium prompt set excludes those solved more than 10, and Hard excludes those solved more than 6. |
| Hardware Specification | Yes | All experiments are conducted using 128 NVIDIA H100 GPUs. |
| Software Dependencies | Yes | Specifically, it relies on antlr4python3-runtime (v4.11.1) and sympy (v1.12). ... and employ the v LLM inference engine (v0.7.3) (Kwon et al., 2023) for sample generation. |
| Experiment Setup | Yes | We used a batch size of 128, sampling G = 8 responses per prompt for 8K length training and 16 responses otherwise. We adopted a learning rate of 1e-6 with Adam W (Kingma, 2014), and set both the entropy loss coefficient and KL loss coefficient β to 0. |