Instruction Tuning With Loss Over Instructions
Authors: Zhengxiang Shi, Adam Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, Aldo Lipani
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through experiments across 21 diverse benchmarks, we show that, in many scenarios, IM can effectively improve the LM performance on both NLP tasks (e.g., MMLU, Truthful QA, and Human Eval) and open-ended generation benchmarks (e.g., MTBench and Alpaca Eval). |
| Researcher Affiliation | Academia | Zhengyan Shi1 Adam X. Yang2 Bin Wu1 Laurence Aitchison2 Emine Yilmaz1 Aldo Lipani1 1University College London 2University of Bristol {zhengxiang.shi.19,bin.wu.23,emine.yilmaz,aldo.lipani}@ucl.ac.uk {adam.yang,laurence.aitchison}@bristol.ac.uk |
| Pseudocode | No | The paper describes mathematical loss functions but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Zhengxiang Shi/Instruction Modelling. |
| Open Datasets | Yes | Instruction Tuning Datasets. We assess our method, IM, across various instruction tuning datasets, detailed as follows: (1) Stanford Alpaca [61] (52 002 examples); (2) Dolly [18] (15 011 examples); (3) Sharegpt [13] (50 000 examples); (4) Code Alpaca [10] (20 022 examples); (5) Science Literature [30] (7 544 examples); (6) Wizard LM [69] (30 000 examples); (7) Tulu V2 [30] (326 181 examples). Additionally, we incorporate instruction tuning datasets under the low-resource setting or SAH: (8) LIMA [78] (1 030 examples); (9) Less1 [68], where high-quality instruction tuning data are selected from Flan V2 and Dolly. Here, we use the Less MMLU Chat (13 533 examples), Less BBH ICL (13 533 examples), and Less Tydiqa (13 533 examples); (10) Alpagasus2 [11], which offers three subsets: Alpagasus Dolly 3k (2 996 examples), Alpagasus Dolly 9k (9 229 examples) selected from Dolly, and Alpagasus Alpaca 5k (5 305 examples) selected from Stanford Alpaca. |
| Dataset Splits | No | The paper mentions using specific instruction tuning datasets as training examples and then evaluating on separate NLP benchmarks and open-ended generation benchmarks as test sets. However, it does not explicitly state the use of a distinct validation set split from the training data for hyperparameter tuning or early stopping during their model training. |
| Hardware Specification | Yes | In our study, we fine-tune the LLa MA-2-7B, LLa MA-2-13B and OPT-6.7 model using four A100 80G GPUs, with a per-GPU batch size of 1 and a total batch size of 128, employing a learning rate of 2e-5. |
| Software Dependencies | No | The paper states: "Our code is implemented using Open-Instruct23, Pytorch24 and Huggingface25." However, it does not provide specific version numbers for PyTorch or Huggingface libraries, which are necessary for reproducible software dependency listing. |
| Experiment Setup | Yes | Table 6: Hyperparameters and configurations for supervised fine-tuning. GPUs 2 or 4 A100 80G GPUs, 2 48G A6000 GPUs; Batch size per GPU 1; Total batch size 128; Number of epochs 2, 3, or 10; Maximum sequence length 2048; Learning rate 2 10 5; Learning rate optimizer Adam W; Adam epsilon 1e-6; Adam beta weights 0.9, 0.98; Learning rate scheduler Linear with warmup; Warmup proportion 0.03; Weight decay 0; Mixed precision bf16; Gradient accumulation steps Calculated dynamically. |