Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models
Authors: Muzhi Dai, Chenxu Yang, Qingyi Si
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations demonstrate that S-GRPO is compatible with state-of-the-art reasoning models, including Qwen3 and Deepseek-distill. Across diverse benchmarks such as GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond, SGRPO achieves a substantial reduction in sequence length (40.4% 61.1%) while simultaneously improving accuracy (absolute 0.72% 3.92%). |
| Researcher Affiliation | Collaboration | Muzhi Dai1 , Chenxu Yang2 , Qingyi Si1 , 1Huawei Technologies Co., Ltd. 2Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Serial-Group Decaying-Reward Policy Optimization (S-GRPO) |
| Open Source Code | No | We pioneer a serial-group RL paradigm that overcomes the critical limitation of outcome-reward RL in regulating intermediate reasoning processes, accompanied by an open-sourced training framework (released once accepted). |
| Open Datasets | Yes | Training datasets. We selected problems from Deep Math-103K [23] to build our training set. ... Benchmarks. To comprehensively assess the models capabilities across a range of reasoning tasks, we select five popular benchmarks that reflect diverse levels of difficulty: GSM8K [16]... AIME 2024 [17]... AMC 2023 [18]... MATH-500 [19]... GPQA [20]... |
| Dataset Splits | No | The paper mentions benchmarks for evaluation (GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond) and a training dataset (Deep Math-103K), but does not provide explicit details on how these datasets are split into training, validation, or test sets within the paper. It only mentions the number of trials performed for evaluation runs. |
| Hardware Specification | No | In our experiments, 64 80g memory was used to train the models. |
| Software Dependencies | No | Across all experiments, we employ Adam [29] as the standard optimizer. No specific versions for software or libraries are provided. |
| Experiment Setup | Yes | For S-GRPO, we use a learning rate of 1 10 6 and randomly select 8 temporal positions for each query. Since we adopt an on-policy mode, the generation batch size and training batch size are both set to 128 8. For GRPO, we use the same learning rate and batch size settings. For RL + Length Penalty, we follow the settings described in its original paper [27] and set the scalar parameter α to 0.2. |