Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Data Influence in Reinforcement Finetuning
Authors: Haoru Tan, Xiuzhe Wu, Sitong Wu, Shaofeng Zhang, Yanfeng Chen, Xingwu Sun, Jeanne Shen, Xiaojuan Qi
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that RFT-Inf consistently improves reward performance and accelerates convergence in reinforcement fine-tuning. We validate RFT-Inf across multiple benchmarks, demonstrating its effectiveness in identifying high-impact samples and improving final model performance. Using RFT-Inf for data selection yields significant performance gains over baselines that rely on full datasets or heuristic selection strategies. Notably, in mathematical reasoning tasks, we only require about 20% data selected by our data influence estimator to achieve more stable training and superior results compared to using the entire dataset. Compared to various heuristic or rule-based data selection methods, our approach significantly outperforms them in both performance and generalization. 5 Experiments We conducted comprehensive experiments to evaluate the effectiveness of our proposed method. Sec. 5.1 is the main experiment. Then, Sec. 5.2 provides detailed ablation studies to analyze the impact of key components in our approach. |
| Researcher Affiliation | Collaboration | Haoru Tan1 Xiuzhe Wu5 Sitong Wu3 Shaofeng Zhang4 Yanfeng Chen2 Xingwu Sun2 Jeanne Shen5 Xiaojuan Qi1 1The University of Hong Kong 2Hunyuan Team, Tencent 3The Chinese University of Hong Kong 4University of Science and Technology of China 5Stanford University |
| Pseudocode | Yes | Algorithm 1: Data Selection Pipeline Require: A dataset Z = {(si, yi)}N i=1 and selection budget δ; A large language model πθ and a reinforcement fine-tuning algorithm RFT 1: Train the model πθ for E epochs on Z using RFT and save checkpoints: {θ1, . . . , θE} RFT(πθ, Z, E) 2: for each sample zi = (si, yi) Z do 3: Calculate the data influence estimator ˆD(zi) according to Eq. (5) 4: end for 5: Select the top-δ samples based on their data influence estimators to form the new subset Znew 6: return The subset model Znew |
| Open Source Code | No | The code and data will be made publicly available upon acceptance through peer review. |
| Open Datasets | Yes | We utilized the dataset released by Deep Scale R [14], which is a comprehensive mathematical dataset compiled from multiple sources, with duplicates removed and data cleaned. This dataset includes AIME problems from 1984 to 2023 and AMC problems before 2023, along with questions from the Omni-MATH [61] and STILL [62] datasets, featuring problems from various national and international mathematics competitions. This training dataset contains approximately 40,000 math problem-answer pairs. To evaluate the reasoning abilities of the models, we utilize five different mathematics benchmarks: AIME24 [20], MATH-500 [63], AMC23 [64], Minerva [65], and Olympiad Bench [66]. |
| Dataset Splits | No | The paper describes using a "full training dataset" (40,000 math problem-answer pairs) for surrogate training and then selecting a "top-δ samples" (e.g., 20% selection ratio) to form a new subset for formal reinforcement fine-tuning. Evaluation is done on external benchmarks (AIME24, MATH-500, AMC23, Minerva, and Olympiad Bench). However, the paper does not specify how the primary training dataset itself is split into distinct training, validation, and test sets to reproduce the experimental results. The benchmarks serve as independent test sets, but typical train/validation splits from the main training data are not described. |
| Hardware Specification | Yes | The experiments were conducted using the PyTorch framework on two high-performance computing servers, each equipped with eight NVIDIA H200 GPUs. |
| Software Dependencies | No | The experiments were conducted using the PyTorch framework on two high-performance computing servers... The paper mentions using the 'PyTorch framework' but does not provide a specific version number for PyTorch or any other key software dependencies used in the experiments. |
| Experiment Setup | Yes | For our approach, we made the following settings: during the surrogate training phase, we performed Lo RA training with a Lo RA rank set to 16 and a total of 2 training epochs. We optimized the network using the Adam W optimizer with a constant learning rate of 1 e 6 and a weight decay of 0.1. |