NEFTune: Noisy Embeddings Improve Instruction Finetuning
Authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Standard finetuning of LLa MA-2-7B using Alpaca achieves 29.79% on Alpaca Eval, which rises to 64.69% using noisy embeddings. |
| Researcher Affiliation | Collaboration | 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University |
| Pseudocode | Yes | Algorithm 1 NEFTune: Noisy Embedding Instruction Finetuning |
| Open Source Code | Yes | Code is available on Github: https://github.com/neelsjain/NEFTune. |
| Open Datasets | Yes | Alpaca (Taori et al., 2023) was constructed using the Self-Instruct method of Wang et al. (2022), and the Text-Davinci-003 model (Ouyang et al., 2022). [...] Evol-Instruct (Xu et al., 2023) contains 70k single-turn instructions [...]. Open-Platypus (Lee et al., 2023) is a curated dataset amalgamated from 11 open-source datasets [...]. Share GPT (Chiang et al., 2023) is a dataset of 70K voluntarily-shared Chat GPT conversations (Share GPT, 2023). |
| Dataset Splits | No | The paper describes hyperparameter tuning through a 'coarse sweep on LLa MA-1 (7B) trained on the Alpaca dataset' but does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | Yes | We finetune the 7B parameter models on four A5000s and 13B parameters on eight A5000s using bfloat16 precision. |
| Software Dependencies | No | The paper mentions 'bfloat16 precision' and 'open source software' but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We use learning rate of 5e-5 and the Adam optimizer for all 7B models... We train all models for 3 epochs on all datasets setting the same seed for each run with an effective batch size of 128 (4 cards, batch size 4, 8 gradient accumulation steps). |