Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs

Authors: Yi Hu, Shijia Kang, Haotong Yang, Haotian Xu, Muhan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning, beyond the common addition or multiplication tasks. Secondly, we show that cross-task length generalization is possible with Meta-RFFT after training on a large number of tasks and instances, the models achieve remarkable length generalization ability on unseen tasks with minimal fine-tuning or one-shot prompting. For example, after fine-tuning on 1 to 5 digit addition, our 32B model achieves 95% accuracy on 30 digit addition, significantly outperforming the state-of-the-art reasoning models (Deep Seek-R1671B: 72%; Qw Q-32B: 32%), despite never seeing this task during RF-pretraining. Our code is available at https://github.com/MuLabPKU/Meta-RFFT.
Researcher Affiliation	Collaboration	Yi Hu1,* Shijia Kang1,* Haotong Yang1 Haotian Xu2 Muhan Zhang1, 1Institute for Artificial Intelligence, Peking University 2Xiaohongshu Inc.
Pseudocode	Yes	For each task, we manually annotate the code (or pseudo-code) for each task as its rule, as well as a template script that can generate a detailed trajectory process for rule-following for each question.
Open Source Code	Yes	Our code is available at https://github.com/MuLabPKU/Meta-RFFT.
Open Datasets	Yes	As our first contribution, we construct a large length generalization dataset containing 86 tasks spanning code execution, number processing, symbolic and logical reasoning, beyond the common addition or multiplication tasks. Our data sources are as follows: Leet Code Problems.1 https://leetcode.com/problemset NUPA. NUPA is a benchmark designed to assess the basic number processing capabilities of LLMs [52]. Big-Bench Hard. The benchmark includes reasoning tasks considered challenging for LLMs [42]. Symbolic Reasoning. We select coin flip and last letter concatenation from Wei et al. [48].
Dataset Splits	Yes	For each task, 300 rule-following samples are generated for each length from 1 to 15, resulting in approximately 310k samples in total. We experiment on models of two different sizes: Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct [39]. In the downstream adaptation stage, we evaluate the models on 4 NUPA tasks and 8 Leet Code tasks of appropriate difficulty and practical significance respectively. The description of each downstream task is provided in Appendix C.3. We first train the models on data of lengths from 1 to 5 and then test their performance on out-of-distribution (OOD) lengths from 6 to 30 to evaluate the length generalization performance. For each task, we generate 1,000 samples for each length from 1 to 5, resulting in a total of 5k training samples. Both the 7B and 32B models are fine-tuned through Pi SSA in the downstream adaptation stage. We evaluate models on 100 samples per length per task.
Hardware Specification	Yes	We conduct all the experiments on NVIDIA H800 Tensor Core GPUs.
Software Dependencies	No	The paper mentions using Python scripts for data annotation, but does not provide specific version numbers for Python or any libraries used, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	As introduced in Section 3.1, our Meta-RFFT involves RF-pretraining and downstream adaptation. As shown in Table 1, in the RF-pretraining stage, we fine-tune the model on 74 tasks, aiming to develop a model that can strictly follow rules across multiple tasks and potentially transfer this capability to new tasks. For each task, 300 rule-following samples are generated for each length from 1 to 15, resulting in approximately 310k samples in total. We experiment on models of two different sizes: Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct [39]. The 7B model is fine-tuned with full-parameter training, while the 32B model is fine-tuned with Pi SSA [29]. In the downstream adaptation stage, we evaluate the models on 4 NUPA tasks and 8 Leet Code tasks of appropriate difficulty and practical significance respectively. The description of each downstream task is provided in Appendix C.3. We first train the models on data of lengths from 1 to 5 and then test their performance on out-of-distribution (OOD) lengths from 6 to 30 to evaluate the length generalization performance. For each task, we generate 1,000 samples for each length from 1 to 5, resulting in a total of 5k training samples. Both the 7B and 32B models are fine-tuned through Pi SSA in the downstream adaptation stage. We evaluate models on 100 samples per length per task. Detailed training hyperparameters are provided in Appendix D.