Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

CPPO: Accelerating the Training of Group Relative Policy Optimization-Based Reasoning Models

Authors: Zhihang Lin, Mingbao Lin, Yuan Xie, Rongrong Ji

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that CPPO achieves up to 7.98 speedup on GSM8K and 3.48 on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
Researcher Affiliation Collaboration Zhihang Lin1,2, Mingbao Lin3, Yuan Xie2,4 , Rongrong Ji1 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China 2Shanghai Innovation Institute, China 3Rakuten, Singapore 4East China Normal University, Shanghai, China EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes We provide the algorithm for our Completion Pruning Policy Optimization (CPPO) in Algorithm 1. For dynamic completion allocation, we adopt a more efficient implementation. Specifically, in Sec. 3.4 of the main paper, we describe completion allocation after completion pruning to provide a more intuitive explanation of our method. However, in the algorithm and our code, we perform completion allocation before completion pruning. Algorithm 1 Completions Pruning Policy Optimization
Open Source Code Yes Experiments show that CPPO achieves up to 7.98 speedup on GSM8K and 3.48 on Math while preserving or even enhancing the accuracy compared to the original GRPO. We release our code at https://github.com/lzhxmu/CPPO.
Open Datasets Yes We conduct an ablation study on GSM8K [4] using Qwen2.5-1.5B-Instruct [29]. We evaluate the performance on multiple benchmarks with different difficulties, including Math [8], AIME2024 [18], AMC2023 [17], and GSM8K [4].
Dataset Splits Yes We train Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct on math datasets including Math [8] and GSM8K [4]. The results demonstrate that CPPO achieves up to 7.98 speedup on GSM8K and 3.48 on Math while preserving or even enhancing the accuracy compared to the original GRPO. Each model is evaluated on the corresponding test subset. To further assess out-of-distribution reasoning ability, we test Qwen2.5-7B-Instruct on AMC2023 [17] and AIME2024 [18], as these benchmarks are too difficult for Qwen2.5-1.5B-Instruct. We train Qwen2.5-1.5B-Instruct on the GSM8K training subset three times independently to calculate the mean and standard deviation
Hardware Specification Yes Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct are trained on two and four GPUs (each with 80GB memory), respectively.
Software Dependencies No We implement CPPO on the Open R1 [5] and verl [22] frameworks, utilizing the v LLM inference library [14] for efficient completion generation.
Experiment Setup Yes We set ϵ = 0.2 and β = 0.04 in Eq. (7), batch size to 16, number of epochs to 1, and learning rate to 1 10 6. The policy model temperature is 1, group size is 16, and the maximum completion length is 1024.