Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

T-SHIRT: Token-Selective Hierarchical Data Selection for Instruction Tuning

Authors: Yanjun Fu, Faisal Hamman, Sanghamitra Dutta

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across multiple instruction tuning datasets and pretrained LLMs show that our proposed method T-SHIRT outperforms existing baselines on a wide range of downstream tasks. At the same time, our method remains computationally and financially efficient. For example, T-SHIRT requires only about 40 minutes to select data from the 52k-sample Alpaca-GPT-4 [11] dataset using GPT-2 on a single GPU.
Researcher Affiliation	Academia	Yanjun Fu Faisal Hamman Sanghamitra Dutta University of Maryland, College Park EMAIL
Pseudocode	Yes	Algorithm 1: Token-Selective Hierarchical Data Selection for Instruction Tuning (T-SHIRT)
Open Source Code	Yes	Our code is available at https://github.com/Dynamite321/T-SHIRT.
Open Datasets	Yes	We conduct data selection on two widely used instruction-tuning datasets of different initial qualities and scales: Alpaca-GPT-4 [11] and Magpie [10]. Magpie [10], specifically the Magpie-Pro-300K-Filtered version,2 is a fully synthetic dataset comprising 300k high-quality instruction-response pairs. 2https://huggingface.co/datasets/Magpie-Align/Magpie-Pro-300K-Filtered
Dataset Splits	Yes	We use various data selection methods to select 5% of the 52k samples in Alpaca-GPT-4 and approximately 3.3% (10k samples) of the 300k-sample Magpie dataset. These include six standardized benchmarks from the Open LLM Leaderboard [35, 36]: ARC-Challenge [37], Hella Swag [38], MMLU [39], Truthful QA [40], BBH [41], and GSM8k [42]. For these benchmarks, we use the LM-Evaluation-Harness [43] and report their default evaluation metrics.
Hardware Specification	Yes	All instruction tuning experiments are conducted on a server equipped with NVIDIA A6000 GPUs.
Software Dependencies	No	The paper mentions using GPT-2 for S-IFD score computation and LM-Evaluation-Harness [43] for benchmarks, but does not specify version numbers for general software libraries or frameworks (e.g., Python, PyTorch, CUDA) required for reproducibility.
Experiment Setup	Yes	Following the same training settings as prior works [16, 17, 46], we use a learning rate of 2e-5 and train for 3 epochs. For data selected from Magpie, we instead follow the setup in Magpie [10] and train for 2 epochs. Additional training details are provided in Appendix B.5.