TSDS: Data Selection for Task-Specific Model Finetuning
Authors: Zifan Liu, Amin Karbasi, Theodoros Rekatsinas
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to validate the effectiveness of our framework. We focus on natural language processing tasks where foundation models have shown great advancements. We show that our framework beats the state-of-the-art baseline [47] by 1.5 points in F1 score on average with a selection ratio of 1% on instruction tuning for two modern large language models on three tasks. |
| Researcher Affiliation | Collaboration | Zifan Liu University of Wisconsin-Madison Madison, WI zliu676@wisc.edu Amin Karbasi Yale University New Haven, CT amin.karbasi@yale.edu Theodoros Rekatsinas Apple Zürich, Switzerland trekatsinas@apple.com |
| Pseudocode | Yes | Algorithm 1: KNN-Uniform. Algorithm 2: KNN-KDE. Algorithm 3: KNN-TV. |
| Open Source Code | Yes | Our code is available at https://github.com/Zifan L/TSDS. |
| Open Datasets | Yes | We use a combination of FLAN V2 [31], COT [45], DOLLY [8], and OPEN ASSISTANT [26] as the data repository for selection, which contains 270K examples. [...] We select data for continued pertaining from a data repository consisting of 150M sequences crafted by Xie et al. [48] from The Pile [14]. |
| Dataset Splits | Yes | The dataset size is 0.5% / 1.0% / 5% of the data repository. [...] Table 3: Training, validation, test sizes and the number of classes in the datasets. Dataset Domain Train Validation Test Classes Metric Chem Prot [25] Biomedical 4,169 2,427 3,469 13 |
| Hardware Specification | Yes | We use an NVIDIA A100 Tensor Core GPU with 40G memory for instruction tuning. [...] The hardware for continued pretraining and supervised finetuning is an NVIDIA Tesla V100 GPU with 32GB memory. [...] We use a machine with an Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz (40 cores) and 250GB RAM. |
| Software Dependencies | No | We first build a coarse Faiss [23] index for the data repository D and use it to retrieve the 2000 nearest neighbors of each query example. [...] If we employ approximate nearest neighbor search techniques such as HNSW [34] for real vectors and l2 distance, we have T1 = O((M + N) log N) and T2 = O(ML log(ML)). No specific version numbers are provided for these libraries. |
| Experiment Setup | Yes | We apply Lo RA [20] for parameter-efficient instruction tuning for the experiments in Section 5.1. The hyperparameters are shown in Table 5. (Table 5 lists: maximum token length 2048, batch size 128, epochs 4, optimizer Adam W, weight decay 0.0, Adam β1 0.9, Adam β2 0.999, Adam ϵ 1e-8, warmup ratio 0.03, learning rate scheduler cosine, learning rate 2e-5, Lo RA rank 128, Lo RA α 512, Lo RA dropout rate 0.1.) |