Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
GVPO: Group Variance Policy Optimization for Large Language Model Post-Training
Authors: Kaichen Zhang, Yuzhong Hong, Junwei Bao, Hongfei Jiang, Yang Song, Hong Dingqian, Hui Xiong
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive evaluation on math reasoning. Specifically, we post-train the Qwen2.5-Math-7B model on Competition Math dataset [14] and assess performance on AIME2024 [26], AMC [26], Math500 [15], Minerva [25], and Olympiad Bench [13]. For answer verification, we utilize the x Verify framework [6]. We adopt the pass@1 accuracy for all benchmarks except AIME2024, where we report avg@32 accuracy to account for its limited size (30 problems) and high difficulty. In addition to math reasoning, we also evaluate GVPO on the summarization task in Appendix C. Table 1 shows the main experiment result, which demonstrates that GVPO achieves the best performance, outperforming both the base model and other variants in all benchmarks. |
| Researcher Affiliation | Collaboration | 1Thrust of Artificial Intelligence, Hong Kong University of Science and Technology (Guangzhou) 2Zuoyebang Education Technology 3Department of Computer Science and Engineering, HKUST |
| Pseudocode | Yes | Algorithm 1 Group Variance Policy Optimization Require: initial policy πθ; prompt distribution D; hyperparameter β 1: for step = 1, . . . , n do 2: Sample a batch Db from D 3: Update the old policy model πθold πθ 4: Sample k responses {yi}k i=1 πs( |x) for each prompt x Db 5: Compute rewards {R(x, yi)}k i=1 for every sampled response yi and prompt x 6: Iteratively update policy πθ by minimizing the GVPO loss (Equation 8, setting πθ = πθold) 7: end for 8: Return πθ |
| Open Source Code | Yes | Corresponding Authors. Code available at https://github.com/jszkc/GVPO |
| Open Datasets | Yes | Specifically, we post-train the Qwen2.5-Math-7B model on Competition Math dataset [14] and assess performance on AIME2024 [26], AMC [26], Math500 [15], Minerva [25], and Olympiad Bench [13]. For answer verification, we utilize the x Verify framework [6]. We adopt the pass@1 accuracy for all benchmarks except AIME2024, where we report avg@32 accuracy to account for its limited size (30 problems) and high difficulty. In addition to math reasoning, we also evaluate GVPO on the summarization task in Appendix C. |
| Dataset Splits | No | For each step, we sample 1024 prompts from the training set and set the mini-batch size in each step to 256. We repeat the whole training set for 10 epochs and set the warm-up ratio to 5%. We grid-search the learning rate in {5e 7, 1e 6, 5e 6, 1e 5} and find 5e 6 to be the best setting. We conduct the main experiment using an Deepseek-R1-like chat template on top of Qwen2.5-Math-7B as in [18]. |
| Hardware Specification | Yes | Compute Resources. We conduct our experiments using a server with eight 80GB H800 GPU cards. For Qwen2.5-Math-7B experiments with k = 5, it takes 6 to 8 minutes per training step and approximately 12 hours per experiment. For Qwen2.5-Math-1.5B experiments with k = 8, it takes 4 to 5 minutes per training step and approximately 8 hours per experiment. |
| Software Dependencies | No | It is easy to implement GVPO based on open-source RL framework. For example3, we show the minimum viable implementation of GVPO that only modifies a few line of GRPO loss in verl [40]: |
| Experiment Setup | Yes | Hyperparameter Recipe. For each step, we sample 1024 prompts from the training set and set the mini-batch size in each step to 256. We repeat the whole training set for 10 epochs and set the warm-up ratio to 5%. We grid-search the learning rate in {5e 7, 1e 6, 5e 6, 1e 5} and find 5e 6 to be the best setting. We conduct the main experiment using an Deepseek-R1-like chat template on top of Qwen2.5-Math-7B as in [18]. In the ablation experiments, for faster training and GPU memory limitations, we use the original Qwen chat template on top of Qwen2.5-Math-1.5B. To ensure a fair comparison across methods, we maintain identical experimental settings while only modifying the algorithmic component. For GVPO, we employ β = 0.1 and πs = πθold in the main experiment. For competing approaches, we utilize hyperparameters specified in their original publications. All experiments generate k = 5 responses per prompt. A comprehensive description of the training details is provided in Appendix A.1. |