Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Best Instruction-Tuning Data are Those That Fit

Authors: Dylan Zhang, Qirun Dai, Hao Peng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first evaluate GRAPE with a controlled experiment, where we sample various solutions for each question in Ultra Interact from multiple models and finetune on GRAPE-selected data using LMs from different families including LLa MA.1-8B, Mistral-7B and Qwen2.5-7B. GRAPE significantly outperforms strong baselines, including distilling from the strongest model with absolute gain up to 13.8% averaging across benchmarks, and a baseline trained on 3 more data with maximum 17.3% performance improvements.
Researcher Affiliation Academia Dylan Zhang University of Illinois Urbana Champaign EMAIL Qirun Dai University of Chicago EMAIL Hao Peng University of Illinois Urbana Champaign EMAIL
Pseudocode No As diagrammed in Figure 3, GRAPE consists of two main steps, followed by standard SFT: Response Collection ( 3.1) Collect a pool of high-quality candidate responses from various sources. Customization ( 3.2): For the target model to be finetuned πθ0, find the response(s), for each instruction, that are closest to the pretrained distribution of πθ0.
Open Source Code No Justification: We use publicly available datasets and models. Our method only requires computing the normalized probability of training data, which can be easily done with any open-sourced machine learning codebase.
Open Datasets Yes We use Ultra Interact-SFT (Yuan et al., 2024b), which contains approximately 80, 800 unique instructions... We evaluate on a set of commonly used benchmarks spanning over coding, math, knowledge and instruction-following. We evaluated on Leet Code (Guo et al., 2024a), MATH (Hendrycks et al., 2021b), Big Bench Hard(BBH) (Suzgun et al., 2022), MMLU (Hendrycks et al., 2021a), and Alpaca Eval-V2 (Dubois et al., 2024). Justification: Yes, we cite all the datasets and pretrained models used in the paper, which are all open-sourced for research use.
Dataset Splits No We use Ultra Interact-SFT (Yuan et al., 2024b), which contains approximately 80, 800 unique instructions... We ensure that all training configurations (GRAPE, baselines) use the same number of instructions and responses per instruction as original Ultra Interact-SFT, unless otherwise stated. For our experiments, we train small reference models corresponding to the final target models... trained on a random 5% subset of the dataset over four epochs.
Hardware Specification Yes We train our models on a 4-GPU Nvidia-GH200 node, with batch size 256 and micro batch size 2.
Software Dependencies No We then perform K-means clustering using the Faiss library to efficiently partition the trajectory space into 100 clusters.
Experiment Setup Yes We train all models for 1 epoch with a learning rate of 10 5. We train our models on a 4-GPU Nvidia-GH200 node, with batch size 256 and micro batch size 2.