Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Authors: Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
Researcher Affiliation Industry Equal contribution. Correspondence to: EMAIL
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present structured, step-by-step procedure blocks in such a format.
Open Source Code Yes We open-source model weights and data for reproducing our results at https://huggingface.co/collections/ nvidia/acereason-682f4e1261dc22f697fd1485.
Open Datasets Yes We open-source the dataset for the benefit of the community at https://huggingface.co/datasets/nvidia/Ace Reason-Math
Dataset Splits Yes At the 24K response length RL stage, we build {Easy, Medium, Hard}-prompt sets based on difficulty estimated by 7B model s performance over 16 attempts. Easy prompt set includes prompts except those solved more than 14 times, Medium prompt set excludes those solved more than 10, and Hard excludes those solved more than 6.
Hardware Specification Yes All experiments are conducted using 128 NVIDIA H100 GPUs.
Software Dependencies Yes Specifically, it relies on antlr4python3-runtime (v4.11.1) and sympy (v1.12). ... and employ the v LLM inference engine (v0.7.3) (Kwon et al., 2023) for sample generation.
Experiment Setup Yes We used a batch size of 128, sampling G = 8 responses per prompt for 8K length training and 16 responses otherwise. We adopted a learning rate of 1e-6 with Adam W (Kingma, 2014), and set both the entropy loss coefficient and KL loss coefficient β to 0.