Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Angles Don’t Lie: Unlocking Training‑Efficient RL Through the Model’s Own Signals

Authors: Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, Yiran Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations show that GAIN-RL (GRPO) achieves over a 2.5 acceleration in training efficiency across diverse mathematical and coding tasks and varying model scales.
Researcher Affiliation Academia Qinsi Wang1 Jinghan Ke2 Hancheng Ye1 Yueqian Lin1 Yuzhe Fu1 Jianyi Zhang1 Kurt Keutzer2 Chenfeng Xu2 Yiran Chen1 1Duke University 2University of California, Berkeley
Pseudocode Yes Algorithm 1: GAIN RL (GRPO): Gradient-driven Angle-Informed Navigated RL Framework
Open Source Code Yes Code is realsed at https://github.com/wangqinsi1/GAINRL/tree/main.
Open Datasets Yes For mathematical evaluations, we employed six benchmark datasets of varying difficulty: GSM8K [3], MATH [4], AMC 23 [18], AIME 24 [19], Olympiad Bench [20], and Minerva Math [48]. For coding evaluations, we utilized three standard benchmark datasets: Livecode Bench (8/1/24 2/1/25) [6], Codeforces [22], and Humaneval+ [23].
Dataset Splits Yes For other experiments (Section 4.2-Section 4.5), we train model on the training dataset of single tasks including GSM8K, MATH and AMC 23 to facilitate more convenient comparisons. ... evaluated on their test sets.
Hardware Specification Yes For instance, the GRPO fine-tuning phase on Qwen 2.5-7B (Ray + v LLM) still consumed roughly 240 GPU hours (16 H100-80 GB for 15h) to complete only 100 steps over 8k samples [11]. ... The training was performed on a single node equipped with 8 A100 GPUs.
Software Dependencies No We trained the models using the GRPO algorithm. The training was performed on a single node equipped with 8 A100 GPUs. Each model was trained for about 200 steps using the ve RL library. To evaluate the training efficiency on GRPO-RL, the main training configuration for Qwen2.5Math-7B-Instruct is shown below. ... python3 -m verl.trainer.main_ppo ...
Experiment Setup Yes We set the target accuracy β = 0.5 to maintain strong gradients during training. Sensitivity parameters α = 2 (for accuracy) and γ = 0.5 (for angle concentration) are tuned on a validation set to ensure stable learning, keeping the tanh function approximately linear over Acc(t) [0, 1] and C(t) [ 1, 1]. Training is conducted using GRPO with a batch size and sampling number n of 1024, implemented on the Ver L framework with 8 NVIDIA A100 GPUs. Additional details are provided in the Appendix E. ... actor_rollout_ref.actor.optim.lr=1e-6 ... data.train_batch_size=1024 ... data.max_prompt_length=1024 ... data.max_response_length=8192