SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning

Authors: Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, Yucong Dai, Yongkai Wu, Hongyi Wang, Ang Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate the datasets curated by SHED. The results demonstrate SHED s superiority over state-of-the-art methods across various tasks and LLMs; notably, datasets comprising only 10% of the original data selected by SHED achieve performance comparable to or surpassing that of the full datasets.
Researcher Affiliation Academia Yexiao He1 Ziyao Wang1 Zheyu Shen1 Guoheng Sun1 Yucong Dai2 Yongkai Wu2 Hongyi Wang3 Ang Li1 1University of Maryland 2Clemson University 3Rutgers University {yexiaohe,ziyaow,zyshen,ghsun,angliece}@umd.edu {yucongd,yongkaw}@clemson.edu hongyi.wang.001@rutgers.edu
Pseudocode No The paper describes the workflow of SHED in text and with a diagram (Figure 2) but does not provide a formal pseudocode or algorithm block.
Open Source Code Yes Code associated with the collection of high-quality datasets curated by SHED can be found at SHED: Shapley-Based Automated Dataset Refinement.
Open Datasets Yes We conduct experiments on two famous benchmark datasets, MMLU (99.8k instances) [54] and Wizard LM-evol-instruct-70k (70k instances) [55].
Dataset Splits No The paper uses the MMLU and Wizard LM datasets. It mentions using '10% instances in the MMLU test set calculating the Shapley values of proxy data' during the SHED implementation, which is for proxy data evaluation, not as a standard validation split for fine-tuning. No explicit training/validation/test splits for the fine-tuning process are provided.
Hardware Specification Yes All the experiments are conducted on two A100 GPUs, each with 80GB of memory.
Software Dependencies No The paper mentions software components like LLa MA-7B, K-means, Agglomerative Clustering, and the lm-evaluation-harness testing framework, but it does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We use the K-means algorithm for the model-agnostic clustering and set the number of clusters to 3000. For the proxy-based Shapley calculator, the value function is set as the accuracy of the foundation model fine-tuned on the proxy data. We use LLa MA-7B [3] as the pre-trained foundation model and 10% instances in the MMLU test set calculating the Shapley values of proxy data. The number of iterations k is set to 10, and the number of instances n removed from the proxy data each step is set to 60. To conserve time and resources, instruction fine-tuning within the proxy-based Shapley calculator is conducted for one epoch. For optimization-aware sampling, we employ the QOCS and QWCS strategies with setting the scaling factor to 1, investigating their efficacy with a variety of target sampling sizes. These implementations are denoted as SHED-QOCS and SHED-QWCS. The target sampling size varies from 1,000 to 20,000 with increments of 1,000, to thoroughly assess the impact of each sampling approach on fine-tuning performance. Appendix A: the number of training epochs was set to 3, the batch size was 128, the Lo RA rank (lora_r) was 128, and the Lo RA alpha (lora_alpha) was 256. For clustering, when the number of clusters (C) was 3000, the number of samples removed per group (n) was 60; when testing the impact of different C values on performance, n = C / 50. The number of iterations for Shapley value calculation (k) was 10, and the learning rate was 3e-4.