Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Learning to Reason under Off-Policy Guidance
Authors: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR. |
| Researcher Affiliation | Academia | 1 Zhejiang University 2 Shanghai AI Laboratory 3 Westlake University 4 Nanjing University 5 The Chinese University of Hong Kong Corresponding to: EMAIL, EMAIL |
| Pseudocode | No | The paper includes mathematical formulations (Eq. 1-8) and theoretical proofs in the appendix, but no explicit pseudocode block or algorithm section is found. |
| Open Source Code | Yes | We provide our code and data in supplementary files. |
| Open Datasets | Yes | Our training set is a subset of Open R1-Math-220k [28] 3, of which the prompts are collected from Numina Math 1.5 [29], and the off-policy reasoning traces are generated by Deepseek-R1 [2]. [28] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. [29] Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9. |
| Dataset Splits | Yes | We conduct experiments using LLa MA-3.1-8B on two subsets of varying difficulty (Easy and Hard), with details provided in Appendix C. As shown in Figure 4, on-policy reinforcement learning performs well on the Easy subset but fails on the Hard subset, where training rewards collapse to zero, since on-policy rollouts struggle to obtain positive feedback signals. In contrast, LUFFY achieves stable reward improvements on both datasets, highlighting its robustness and its ability to overcome limitations imposed by model capacity. Appendix C.1: Easy and Hard Training Set: We first filter the questions for which Deep Seek-R1 can generate a correct answer. Then, we split the data according to the length of Deep Seek-R1 s solution. We coin questions R1 can solve within 2k tokens as Easy set and those within 4k tokens as the Hard set, respectively. ... Finally, the Easy dataset contains 7.3k prompts, and the Hard dataset contains 25.4k prompts. |
| Hardware Specification | Yes | All training experiments are conducted using 8 A100 GPUs. |
| Software Dependencies | No | Our implementation is based on verl5, which uses v LLM6 as the rollout generators. We are thankful for these open-source repositories. |
| Experiment Setup | Yes | We remove the KL loss term by setting β = 0 and set the entropy loss coefficient to 0.01. Following Dr.GRPO[6], we remove the length normalization and standard error normalization of GRPO loss (Eq. 3) for all experiments. For policy shaping, we empirically set the γ as 0.1 and study the value of γ in Appendix E.4. Our rollout batch size is 128, and the update batch size is 64. We use 8 rollouts per prompt. Specifically, for on-policy RL, we use 8 on-policy rollouts. For our methods, we use 1 off-policy and 7 on-policy rollouts to ensure fairness. We use temperature=1.0 for rollout generation. We use Math-Verify as our reward function and include no format or length reward. We use Qwen2.5-Math-7B [30] by default, following previous work [24, 5, 6]. In addition, we extend LUFFY to Qwen2.5-Math-1.5B [30] and Qwen2.5-Instruct-7B [31], and LLa MA 3.1-8B [32]. |