Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
STAFF: Speculative Coreset Selection for Task-Specific Fine-tuning
Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Chao Shen, Tianlin Li, Weipeng Jiang, Yang Liu
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Experiment results show that STAFF outperforms SOTA methods in coreset selection across different pruning rates, improving fine-tuning performance by up to 54.3% compared to the best baseline method and saving up to 70.5% of selection overhead. |
| Researcher Affiliation | Academia | 1Xi an Jiaotong University 2University of Massachusetts, Amherst 3Nanyang Technological University {EMAIL,chaoshen@xjtu,EMAIL}.edu.cn EMAIL {EMAIL,yangliu@ntu}.edu.sg |
| Pseudocode | Yes | Algorithm 1 STAFF for Coreset Selection |
| Open Source Code | Yes | Our code is publicly available at https: //github.com/shiningrain/STAFF. Our implementation and data are publically available1. 1Our code is available at https://github.com/shiningrain/STAFF. To follow the Open Science Policy and support reproducibility, we have released code about our implementations and evaluations. All resources are available in https://github.com/shi ningrain/STAFF. |
| Open Datasets | Yes | We evaluate STAFF on three datasets on different downstream tasks, namely, the Bio Instruct dataset (Tran et al., 2024) (biology question-answering), Dialog Sum dataset (Chen et al., 2021) (dialogue summarization), and the Kazakh-English subset of WMT-19 dataset (Barrault et al., 2019) (translation of minority languages). |
| Dataset Splits | Yes | In the experiment, we divided each dataset into the training set and the test set according to a ratio of 9:1. |
| Hardware Specification | Yes | All fine-tuning experiments are conducted on one NVIDIA RTX A6000 GPU. |
| Software Dependencies | No | While the paper mentions software like Lo RA for fine-tuning and a fine-tuning framework, it does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | We set fine-tuning budget T in selection to 3 and K to 50. The number of samples used in verification for each bin (bv) is 10. For fine-tuning pre-trained models on three datasets of downstream tasks, we perform a grid search over learning rate {1e 5, 2e 5, 1e 4, 2e 4} and the batch size {2, 4, 8}. We opt for a fixed number of epochs (e.g., 4 epochs) in all experiments. Table 5 provides specific learning rates for each model on different datasets. |