Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning to Reason under Off-Policy Guidance

Authors: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Compared with previous RLVR methods, LUFFY achieves an over +6.4 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
Researcher Affiliation	Academia	1 Zhejiang University 2 Shanghai AI Laboratory 3 Westlake University 4 Nanjing University 5 The Chinese University of Hong Kong Corresponding to: EMAIL, EMAIL
Pseudocode	No	The paper includes mathematical formulations (Eq. 1-8) and theoretical proofs in the appendix, but no explicit pseudocode block or algorithm section is found.
Open Source Code	Yes	We provide our code and data in supplementary files.
Open Datasets	Yes	Our training set is a subset of Open R1-Math-220k [28] 3, of which the prompts are collected from Numina Math 1.5 [29], and the off-policy reasoning traces are generated by Deepseek-R1 [2]. [28] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January 2025. [29] Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Huang, Kashif Rasul, Longhui Yu, Albert Q. Jiang, Ziju Shen, et al. Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions. https://huggingface.co/datasets/Numinamath, 2024. Hugging Face repository, 13:9.
Dataset Splits	Yes	We conduct experiments using LLa MA-3.1-8B on two subsets of varying difficulty (Easy and Hard), with details provided in Appendix C. As shown in Figure 4, on-policy reinforcement learning performs well on the Easy subset but fails on the Hard subset, where training rewards collapse to zero, since on-policy rollouts struggle to obtain positive feedback signals. In contrast, LUFFY achieves stable reward improvements on both datasets, highlighting its robustness and its ability to overcome limitations imposed by model capacity. Appendix C.1: Easy and Hard Training Set: We first filter the questions for which Deep Seek-R1 can generate a correct answer. Then, we split the data according to the length of Deep Seek-R1 s solution. We coin questions R1 can solve within 2k tokens as Easy set and those within 4k tokens as the Hard set, respectively. ... Finally, the Easy dataset contains 7.3k prompts, and the Hard dataset contains 25.4k prompts.
Hardware Specification	Yes	All training experiments are conducted using 8 A100 GPUs.
Software Dependencies	No	Our implementation is based on verl5, which uses v LLM6 as the rollout generators. We are thankful for these open-source repositories.
Experiment Setup	Yes	We remove the KL loss term by setting β = 0 and set the entropy loss coefficient to 0.01. Following Dr.GRPO[6], we remove the length normalization and standard error normalization of GRPO loss (Eq. 3) for all experiments. For policy shaping, we empirically set the γ as 0.1 and study the value of γ in Appendix E.4. Our rollout batch size is 128, and the update batch size is 64. We use 8 rollouts per prompt. Specifically, for on-policy RL, we use 8 on-policy rollouts. For our methods, we use 1 off-policy and 7 on-policy rollouts to ensure fairness. We use temperature=1.0 for rollout generation. We use Math-Verify as our reward function and include no format or length reward. We use Qwen2.5-Math-7B [30] by default, following previous work [24, 5, 6]. In addition, we extend LUFFY to Qwen2.5-Math-1.5B [30] and Qwen2.5-Instruct-7B [31], and LLa MA 3.1-8B [32].