Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning

Authors: Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present comprehensive empirical findings that reveal which aspects of reasoning are most enhanced by RL (e.g., flexibility in plan following and integrating knowledge into its reasoning processes), which remain challenging (e.g., robustness in solving subproblems), and the conditions under which RL provides the greatest benefits.
Researcher Affiliation	Collaboration	University of Wisconsin-Madison Salesforce AI Research
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper. Appendix C provides a mathematical formula for the JGRPO objective, but it is not presented as structured pseudocode.
Open Source Code	Yes	Our code, data, and checkpoints are available at: https://sparkle-reasoning.github.io/.
Open Datasets	Yes	Our code, data, and checkpoints are available at: https://sparkle-reasoning.github.io/. SPARKLE is created from diverse mathematical problem benchmarks including AIME24 [32], AMC23 [31], MATH500 [17], GSM8K [5], and Olympiad Bench [16] (test splits).
Dataset Splits	Yes	For Stage 1, we use the training set from Deep Scale R-Preview [29], which contains 40K math questions... To curate the training set for Stage 2, we first identify 6.5K most challenging problems... This results in a curated set of 5.7K difficult problems. ... SPARKLE is created from diverse mathematical problem benchmarks including AIME24 [32], AMC23 [31], MATH500 [17], GSM8K [5], and Olympiad Bench [16] (test splits).
Hardware Specification	Yes	We conduct training and evaluation using 8 NVIDIA H200, 15 NVIDIA A100-PCIE-40GB and 9 NVIDIA A100-SXM4-40GB GPUs
Software Dependencies	Yes	Python 3.10, Py Torch 2.4.0, and Transformers 4.47.1.
Experiment Setup	Yes	We establish baseline model performance using a learning rate of 1e-6 and a KL loss coefficient of 0.001. For Stage 2, ... we maintain the same configurations from Stage 1, except for increasing the KL loss coefficient to 0.01 ... Throughout Stage 2, we use a sampling temperature of 0.6 and generate 32 samples per problem... During evaluation, we use a sampling temperature of 0.6 and a maximum context length of 16k tokens.