Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Authors: Muzhi Dai, Chenxu Yang, Qingyi Si

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations demonstrate that S-GRPO is compatible with state-of-the-art reasoning models, including Qwen3 and Deepseek-distill. Across diverse benchmarks such as GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond, SGRPO achieves a substantial reduction in sequence length (40.4% 61.1%) while simultaneously improving accuracy (absolute 0.72% 3.92%).
Researcher Affiliation	Collaboration	Muzhi Dai1 , Chenxu Yang2 , Qingyi Si1 , 1Huawei Technologies Co., Ltd. 2Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Serial-Group Decaying-Reward Policy Optimization (S-GRPO)
Open Source Code	No	We pioneer a serial-group RL paradigm that overcomes the critical limitation of outcome-reward RL in regulating intermediate reasoning processes, accompanied by an open-sourced training framework (released once accepted).
Open Datasets	Yes	Training datasets. We selected problems from Deep Math-103K [23] to build our training set. ... Benchmarks. To comprehensively assess the models capabilities across a range of reasoning tasks, we select five popular benchmarks that reflect diverse levels of difficulty: GSM8K [16]... AIME 2024 [17]... AMC 2023 [18]... MATH-500 [19]... GPQA [20]...
Dataset Splits	No	The paper mentions benchmarks for evaluation (GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond) and a training dataset (Deep Math-103K), but does not provide explicit details on how these datasets are split into training, validation, or test sets within the paper. It only mentions the number of trials performed for evaluation runs.
Hardware Specification	No	In our experiments, 64 80g memory was used to train the models.
Software Dependencies	No	Across all experiments, we employ Adam [29] as the standard optimizer. No specific versions for software or libraries are provided.
Experiment Setup	Yes	For S-GRPO, we use a learning rate of 1 10 6 and randomly select 8 temporal positions for each query. Since we adopt an on-policy mode, the generation batch size and training batch size are both set to 128 8. For GRPO, we use the same learning rate and batch size settings. For RL + Length Penalty, we follow the settings described in its original paper [27] and set the scalar parameter α to 0.2.