reproducibilityindex.ai

TSDS: Data Selection for Task-Specific Model Finetuning

Authors: Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to validate the effectiveness of our framework. We focus on natural language processing tasks where foundation models have shown great advancements. We show that our framework beats the state-of-the-art baseline [47] by 1.5 points in F1 score on average with a selection ratio of 1% on instruction tuning for two modern large language models on three tasks.
Researcher Affiliation	Collaboration	Zifan Liu University of Wisconsin-Madison Madison, WI zliu676@wisc.edu Amin Karbasi Yale University New Haven, CT amin.karbasi@yale.edu Theodoros Rekatsinas Apple Zürich, Switzerland trekatsinas@apple.com
Pseudocode	Yes	Algorithm 1: KNN-Uniform. Algorithm 2: KNN-KDE. Algorithm 3: KNN-TV.
Open Source Code	Yes	Our code is available at https://github.com/Zifan L/TSDS.
Open Datasets	Yes	We use a combination of FLAN V2 [31], COT [45], DOLLY [8], and OPEN ASSISTANT [26] as the data repository for selection, which contains 270K examples. [...] We select data for continued pertaining from a data repository consisting of 150M sequences crafted by Xie et al. [48] from The Pile [14].
Dataset Splits	Yes	The dataset size is 0.5% / 1.0% / 5% of the data repository. [...] Table 3: Training, validation, test sizes and the number of classes in the datasets. Dataset Domain Train Validation Test Classes Metric Chem Prot [25] Biomedical 4,169 2,427 3,469 13
Hardware Specification	Yes	We use an NVIDIA A100 Tensor Core GPU with 40G memory for instruction tuning. [...] The hardware for continued pretraining and supervised finetuning is an NVIDIA Tesla V100 GPU with 32GB memory. [...] We use a machine with an Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz (40 cores) and 250GB RAM.
Software Dependencies	No	We first build a coarse Faiss [23] index for the data repository D and use it to retrieve the 2000 nearest neighbors of each query example. [...] If we employ approximate nearest neighbor search techniques such as HNSW [34] for real vectors and l2 distance, we have T1 = O((M + N) log N) and T2 = O(ML log(ML)). No specific version numbers are provided for these libraries.
Experiment Setup	Yes	We apply Lo RA [20] for parameter-efficient instruction tuning for the experiments in Section 5.1. The hyperparameters are shown in Table 5. (Table 5 lists: maximum token length 2048, batch size 128, epochs 4, optimizer Adam W, weight decay 0.0, Adam β1 0.9, Adam β2 0.999, Adam ϵ 1e-8, warmup ratio 0.03, learning rate scheduler cosine, learning rate 2e-5, Lo RA rank 128, Lo RA α 512, Lo RA dropout rate 0.1.)