NEFTune: Noisy Embeddings Improve Instruction Finetuning

Authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Standard finetuning of LLa MA-2-7B using Alpaca achieves 29.79% on Alpaca Eval, which rises to 64.69% using noisy embeddings.
Researcher Affiliation Collaboration 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University
Pseudocode Yes Algorithm 1 NEFTune: Noisy Embedding Instruction Finetuning
Open Source Code Yes Code is available on Github: https://github.com/neelsjain/NEFTune.
Open Datasets Yes Alpaca (Taori et al., 2023) was constructed using the Self-Instruct method of Wang et al. (2022), and the Text-Davinci-003 model (Ouyang et al., 2022). [...] Evol-Instruct (Xu et al., 2023) contains 70k single-turn instructions [...]. Open-Platypus (Lee et al., 2023) is a curated dataset amalgamated from 11 open-source datasets [...]. Share GPT (Chiang et al., 2023) is a dataset of 70K voluntarily-shared Chat GPT conversations (Share GPT, 2023).
Dataset Splits No The paper describes hyperparameter tuning through a 'coarse sweep on LLa MA-1 (7B) trained on the Alpaca dataset' but does not specify explicit training, validation, and test dataset splits with percentages or counts.
Hardware Specification Yes We finetune the 7B parameter models on four A5000s and 13B parameters on eight A5000s using bfloat16 precision.
Software Dependencies No The paper mentions 'bfloat16 precision' and 'open source software' but does not provide specific software dependencies with version numbers.
Experiment Setup Yes We use learning rate of 5e-5 and the Adam optimizer for all 7B models... We train all models for 3 epochs on all datasets setting the same seed for each run with an effective batch size of 128 (4 cards, batch size 4, 8 gradient accumulation steps).