Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

Authors: Yanjun Fu, Faisal Hamman, Sanghamitra Dutta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments across multiple instruction tuning datasets and pretrained LLMs show that our proposed method T-SHIRT outperforms existing baselines on a wide range of downstream tasks. At the same time, our method remains computationally and financially efficient. For example, T-SHIRT requires only about 40 minutes to select data from the 52k-sample Alpaca-GPT-4 [11] dataset using GPT-2 on a single GPU.
Researcher Affiliation Academia Yanjun Fu Faisal Hamman Sanghamitra Dutta University of Maryland, College Park EMAIL
Pseudocode Yes Algorithm 1: Token-Selective Hierarchical Data Selection for Instruction Tuning (T-SHIRT)
Open Source Code Yes Our code is available at https://github.com/Dynamite321/T-SHIRT.
Open Datasets Yes We conduct data selection on two widely used instruction-tuning datasets of different initial qualities and scales: Alpaca-GPT-4 [11] and Magpie [10]. Magpie [10], specifically the Magpie-Pro-300K-Filtered version,2 is a fully synthetic dataset comprising 300k high-quality instruction-response pairs. 2https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered
Dataset Splits Yes We use various data selection methods to select 5% of the 52k samples in Alpaca-GPT-4 and approximately 3.3% (10k samples) of the 300k-sample Magpie dataset. These include six standardized benchmarks from the Open LLM Leaderboard [35, 36]: ARC-Challenge [37], Hella Swag [38], MMLU [39], Truthful QA [40], BBH [41], and GSM8k [42]. For these benchmarks, we use the LM-Evaluation-Harness [43] and report their default evaluation metrics.
Hardware Specification Yes All instruction tuning experiments are conducted on a server equipped with NVIDIA A6000 GPUs.
Software Dependencies No The paper mentions using GPT-2 for S-IFD score computation and LM-Evaluation-Harness [43] for benchmarks, but does not specify version numbers for general software libraries or frameworks (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup Yes Following the same training settings as prior works [16, 17, 46], we use a learning rate of 2e-5 and train for 3 epochs. For data selected from Magpie, we instead follow the setup in Magpie [10] and train for 2 epochs. Additional training details are provided in Appendix B.5.