LESS: Selecting Influential Data for Targeted Instruction Tuning
Authors: Mengzhou Xia, Sadhika Malladi, Suchin Gururangan, Sanjeev Arora, Danqi Chen
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. |
| Researcher Affiliation | Academia | 1Princeton Language and Intelligence (PLI), Princeton University, USA 2Department of Computer Science, University of Washington, USA. |
| Pseudocode | Yes | Figure 1: Illustration of LESS. In step 1, we train a selection model MS with Lo RA for a warmup period with a small subset of data Dwarmup D. In step 2, we compute the Adam Lo RA gradient features Γ R|D| P for each candidate datapoint and save them in a gradient datastore. In step 3, for any task with few-shot examples Dval (comprising of m subtasks), we compute the gradient features for each validation subtask and select the subset Dtrain with the top 5% training examples ranked by Inf Adam. Step 4 is the final training stage with the selected data on a target model MT , which can be trained with either Lo RA or full finetuning. Steps 1 and 2 are offline and only need to be computed once per candidate training set D. |
| Open Source Code | Yes | To facilitate future work, we release code and data at princetonnlp/LESS. |
| Open Datasets | Yes | We follow (Wang et al., 2023b) and use the following instruction tuning datasets: (1) datasets created from existing ones such as FLAN V2 (Longpre et al., 2023) and COT (Wei et al., 2022c); (2) open-ended generation datasets with human-written answers including DOLLY (Conover et al., 2023) and OPEN ASSISTANT 1 (K opf et al., 2023). |
| Dataset Splits | Yes | Each dataset includes multiple subtasks, and each subtask comes with few-shot examples. These examples are used as Dval for data selection ( 4.2) and as few-shot in-context learning demonstrations in evaluation. [...] We evaluate on the validation set Dval (the same reference set used for data selection) at the end of each epoch and select the best checkpoint to evaluate on the final test set for each experiment. |
| Hardware Specification | Yes | The reported wall-clock time is measured in single A100 (80GB) GPU hours. |
| Software Dependencies | No | The paper mentions Lo RA (Hu et al., 2021) and the Adam optimizer (Kingma & Ba, 2015), but it does not specify version numbers for any programming languages or software libraries used in the implementation. |
| Experiment Setup | Yes | We employed a learning rate scheduler with linear warm-up and cosine decay, reaching a peak learning rate of 2 10 5. A batch size of 128 was used, and training was carried out for 4 epochs across all selected datasets. [...] For the Lo RA module, we specified a rank of 128, an α value of 512, a dropout rate of 0.1, and learned Lo RA matrices for all attention matrices. |