Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Angles Don’t Lie: Unlocking Training‑Efficient RL Through the Model’s Own Signals
Authors: Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5 acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales. |
| Researcher Affiliation | Academia | Qinsi Wang1 Jinghan Ke2 Hancheng Ye1 Yueqian Lin1 Yuzhe Fu1 Jianyi Zhang1 Kurt Keutzer2 Chenfeng Xu2 Yiran Chen1 1Duke University 2University of California, Berkeley |
| Pseudocode | Yes | Algorithm 1: GAIN RL (GRPO): Gradient-driven Angle-Informed Navigated RL Framework |
| Open Source Code | Yes | Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main. |
| Open Datasets | Yes | For mathematical evaluations, we employed six benchmark datasets of varying difficulty: GSM8K [3], MATH [4], AMC 23 [18], AIME 24 [19], Olympiad Bench [20], and Minerva Math [48]. For coding evaluations, we utilized three standard benchmark datasets: Livecode Bench (8/1/24 2/1/25) [6], Codeforces [22], and Humaneval+ [23]. |
| Dataset Splits | Yes | For other experiments (Section 4.2-Section 4.5), we train model on the training dataset of single tasks including GSM8K, MATH and AMC 23 to facilitate more convenient comparisons. ... evaluated on their test sets. |
| Hardware Specification | Yes | For instance, the GRPO fine-tuning phase on Qwen 2.5-7B (Ray + v LLM) still consumed roughly 240 GPU hours (16 H100-80 GB for 15h) to complete only 100 steps over 8k samples [11]. ... The training was performed on a single node equipped with 8 A100 GPUs. |
| Software Dependencies | No | We trained the models using the GRPO algorithm. The training was performed on a single node equipped with 8 A100 GPUs. Each model was trained for about 200 steps using the ve RL library. To evaluate the training efficiency on GRPO-RL, the main training configuration for Qwen2.5Math-7B-Instruct is shown below. ... python3 -m verl.trainer.main_ppo ... |
| Experiment Setup | Yes | We set the target accuracy β = 0.5 to maintain strong gradients during training. Sensitivity parameters α = 2 (for accuracy) and γ = 0.5 (for angle concentration) are tuned on a validation set to ensure stable learning, keeping the tanh function approximately linear over Acc(t) [0, 1] and C(t) [ 1, 1]. Training is conducted using GRPO with a batch size and sampling number n of 1024, implemented on the Ver L framework with 8 NVIDIA A100 GPUs. Additional details are provided in the Appendix E. ... actor_rollout_ref.actor.optim.lr=1e-6 ... data.train_batch_size=1024 ... data.max_prompt_length=1024 ... data.max_response_length=8192 |