SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning
Authors: Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, Yucong Dai, Yongkai Wu, Hongyi Wang, Ang Li
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to evaluate the datasets curated by SHED. The results demonstrate SHED s superiority over state-of-the-art methods across various tasks and LLMs; notably, datasets comprising only 10% of the original data selected by SHED achieve performance comparable to or surpassing that of the full datasets. |
| Researcher Affiliation | Academia | Yexiao He1 Ziyao Wang1 Zheyu Shen1 Guoheng Sun1 Yucong Dai2 Yongkai Wu2 Hongyi Wang3 Ang Li1 1University of Maryland 2Clemson University 3Rutgers University {yexiaohe,ziyaow,zyshen,ghsun,angliece}@umd.edu {yucongd,yongkaw}@clemson.edu hongyi.wang.001@rutgers.edu |
| Pseudocode | No | The paper describes the workflow of SHED in text and with a diagram (Figure 2) but does not provide a formal pseudocode or algorithm block. |
| Open Source Code | Yes | Code associated with the collection of high-quality datasets curated by SHED can be found at SHED: Shapley-Based Automated Dataset Refinement. |
| Open Datasets | Yes | We conduct experiments on two famous benchmark datasets, MMLU (99.8k instances) [54] and Wizard LM-evol-instruct-70k (70k instances) [55]. |
| Dataset Splits | No | The paper uses the MMLU and Wizard LM datasets. It mentions using '10% instances in the MMLU test set calculating the Shapley values of proxy data' during the SHED implementation, which is for proxy data evaluation, not as a standard validation split for fine-tuning. No explicit training/validation/test splits for the fine-tuning process are provided. |
| Hardware Specification | Yes | All the experiments are conducted on two A100 GPUs, each with 80GB of memory. |
| Software Dependencies | No | The paper mentions software components like LLa MA-7B, K-means, Agglomerative Clustering, and the lm-evaluation-harness testing framework, but it does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We use the K-means algorithm for the model-agnostic clustering and set the number of clusters to 3000. For the proxy-based Shapley calculator, the value function is set as the accuracy of the foundation model fine-tuned on the proxy data. We use LLa MA-7B [3] as the pre-trained foundation model and 10% instances in the MMLU test set calculating the Shapley values of proxy data. The number of iterations k is set to 10, and the number of instances n removed from the proxy data each step is set to 60. To conserve time and resources, instruction fine-tuning within the proxy-based Shapley calculator is conducted for one epoch. For optimization-aware sampling, we employ the QOCS and QWCS strategies with setting the scaling factor to 1, investigating their efficacy with a variety of target sampling sizes. These implementations are denoted as SHED-QOCS and SHED-QWCS. The target sampling size varies from 1,000 to 20,000 with increments of 1,000, to thoroughly assess the impact of each sampling approach on fine-tuning performance. Appendix A: the number of training epochs was set to 3, the batch size was 128, the Lo RA rank (lora_r) was 128, and the Lo RA alpha (lora_alpha) was 256. For clustering, when the number of clusters (C) was 3000, the number of samples removed per group (n) was 60; when testing the impact of different C values on performance, n = C / 50. The number of iterations for Shapley value calculation (k) was 10, and the learning rate was 3e-4. |