Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Authors: Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4 speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. |
| Researcher Affiliation | Collaboration | Jiarui Yao1 Yifan Hao1 Hanning Zhang1 Hanze Dong2 Wei Xiong1 Nan Jiang1 Tong Zhang1 1University of Illinois Urbana-Champaign 2Microsoft Research |
| Pseudocode | Yes | Algorithm 1 Meta Algorithm: GVM-EM ... Algorithm 2 GVM: Practical Implementation ... Algorithm 3 Rejection sampling |
| Open Source Code | No | We use existing open-source datasets as the training and test datasets, and will release our code repo for public access. |
| Open Datasets | Yes | We conduct experiments with Qwen2.5-Math-1.5B and Qwen2.5-Math-7B (Yang et al., 2024b). We focus on the mathematical reasoning task and use Math-Verify as the verifier... Table 1: Performance of different algorithms across five benchmarks including Math500 (Hendrycks et al., 2021), Minerva Math (Lewkowycz et al., 2022), Olympiad Bench (He et al., 2024), AIME24, and AMC23... For the training dataset, we use the Numina-Math (LI et al., 2024). |
| Dataset Splits | Yes | We consider mathematical reasoning with large language models (LLMs): given a prompt x X, and aims to produce a correct final answer z Z... We consider a training set B = {(xi, zi)}m i=1 drawn from d0 with zi being the labeled ground-truth answer to illustrate the idea... For the training dataset, we use the Numina-Math (LI et al., 2024). For the evaluation, we use five benchmarks including Math500 (Hendrycks et al., 2021), Minerva Math (Lewkowycz et al., 2022), Olympiad Bench (He et al., 2024), AIME24, and AMC23. |
| Hardware Specification | Yes | For compute resources, we mainly conduct the experiments on NVIDIA RTX A6000 and H100 GPUs, and each iteration in GVM typically takes 90 minutes with sample sizes N = 8, N = 8n on a 4 H100 GPU server. |
| Software Dependencies | No | We utilize verl (Sheng et al., 2024) as the training framework, and implement the RAFT++ as Xiong et al. (2025a) show that the additional importance sampling and clipping improve over the vanilla RAFT (Dong et al., 2023). |
| Experiment Setup | Yes | For each iteration, we use a prompt batch size of 1024, and use a mini-batch size 256 for gradient update. The max prompt length is set to be 1024, and the models are allowed to generate at most 3072 tokens so that they do not exceed the context window of 4096 tokens. There is no warmup stage and the learning rate is chosen to be a constant 1e-6... Table 2: Full hyperparameters. Parameter Value α 1e-3, β 2, batch size 1024, mini batch size 256, max prompt length 1024, max response length 3072, learning rate 1e-6, KL loss coefficient 0.001. |