Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
Authors: Jiayu Wang, Yifei Ming, Zixuan Ke, Caiming Xiong, Shafiq Joty, Aws Albarghouthi, Frederic Sala
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present comprehensive empirical findings that reveal which aspects of reasoning are most enhanced by RL (e.g., flexibility in plan following and integrating knowledge into its reasoning processes), which remain challenging (e.g., robustness in solving subproblems), and the conditions under which RL provides the greatest benefits. |
| Researcher Affiliation | Collaboration | University of Wisconsin-Madison Salesforce AI Research |
| Pseudocode | No | No explicit pseudocode or algorithm blocks are present in the paper. Appendix C provides a mathematical formula for the JGRPO objective, but it is not presented as structured pseudocode. |
| Open Source Code | Yes | Our code, data, and checkpoints are available at: https://sparkle-reasoning.github.io/. |
| Open Datasets | Yes | Our code, data, and checkpoints are available at: https://sparkle-reasoning.github.io/. SPARKLE is created from diverse mathematical problem benchmarks including AIME24 [32], AMC23 [31], MATH500 [17], GSM8K [5], and Olympiad Bench [16] (test splits). |
| Dataset Splits | Yes | For Stage 1, we use the training set from Deep Scale R-Preview [29], which contains 40K math questions... To curate the training set for Stage 2, we first identify 6.5K most challenging problems... This results in a curated set of 5.7K difficult problems. ... SPARKLE is created from diverse mathematical problem benchmarks including AIME24 [32], AMC23 [31], MATH500 [17], GSM8K [5], and Olympiad Bench [16] (test splits). |
| Hardware Specification | Yes | We conduct training and evaluation using 8 NVIDIA H200, 15 NVIDIA A100-PCIE-40GB and 9 NVIDIA A100-SXM4-40GB GPUs |
| Software Dependencies | Yes | Python 3.10, Py Torch 2.4.0, and Transformers 4.47.1. |
| Experiment Setup | Yes | We establish baseline model performance using a learning rate of 1e-6 and a KL loss coefficient of 0.001. For Stage 2, ... we maintain the same configurations from Stage 1, except for increasing the KL loss coefficient to 0.01 ... Throughout Stage 2, we use a sampling temperature of 0.6 and generate 32 samples per problem... During evaluation, we use a sampling temperature of 0.6 and a maximum context length of 16k tokens. |