Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STAFF: Speculative Coreset Selection for Task-Specific Fine-tuning

Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, Yang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Experiment results show that STAFF outperforms SOTA methods in coreset selection across different pruning rates, improving fine-tuning performance by up to 54.3% compared to the best baseline method and saving up to 70.5% of selection overhead.
Researcher Affiliation	Academia	1Xi an Jiaotong University 2University of Massachusetts, Amherst 3Nanyang Technological University {EMAIL,chaoshen@xjtu,EMAIL}.edu.cn EMAIL {EMAIL,yangliu@ntu}.edu.sg
Pseudocode	Yes	Algorithm 1 STAFF for Coreset Selection
Open Source Code	Yes	Our code is publicly available at https: //github.com/shiningrain/STAFF. Our implementation and data are publically available1. 1Our code is available at https://github.com/shiningrain/STAFF. To follow the Open Science Policy and support reproducibility, we have released code about our implementations and evaluations. All resources are available in https://github.com/shi ningrain/STAFF.
Open Datasets	Yes	We evaluate STAFF on three datasets on different downstream tasks, namely, the Bio Instruct dataset (Tran et al., 2024) (biology question-answering), Dialog Sum dataset (Chen et al., 2021) (dialogue summarization), and the Kazakh-English subset of WMT-19 dataset (Barrault et al., 2019) (translation of minority languages).
Dataset Splits	Yes	In the experiment, we divided each dataset into the training set and the test set according to a ratio of 9:1.
Hardware Specification	Yes	All fine-tuning experiments are conducted on one NVIDIA RTX A6000 GPU.
Software Dependencies	No	While the paper mentions software like Lo RA for fine-tuning and a fine-tuning framework, it does not provide specific version numbers for these software components or any other libraries.
Experiment Setup	Yes	We set fine-tuning budget T in selection to 3 and K to 50. The number of samples used in verification for each bin (bv) is 10. For fine-tuning pre-trained models on three datasets of downstream tasks, we perform a grid search over learning rate {1e 5, 2e 5, 1e 4, 2e 4} and the batch size {2, 4, 8}. We opt for a fixed number of epochs (e.g., 4 epochs) in all experiments. Table 5 provides specific learning rates for each model on different datasets.