Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
NEFTune: Noisy Embeddings Improve Instruction Finetuning
Authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Standard finetuning of LLa MA-2-7B using Alpaca achieves 29.79% on Alpaca Eval, which rises to 64.69% using noisy embeddings. |
| Researcher Affiliation | Collaboration | 1 University of Maryland, 2 Lawrence Livermore National Laboratory, 3 New York University |
| Pseudocode | Yes | Algorithm 1 NEFTune: Noisy Embedding Instruction Finetuning |
| Open Source Code | Yes | Code is available on Github: https://github.com/neelsjain/NEFTune. |
| Open Datasets | Yes | Alpaca (Taori et al., 2023) was constructed using the Self-Instruct method of Wang et al. (2022), and the Text-Davinci-003 model (Ouyang et al., 2022). [...] Evol-Instruct (Xu et al., 2023) contains 70k single-turn instructions [...]. Open-Platypus (Lee et al., 2023) is a curated dataset amalgamated from 11 open-source datasets [...]. Share GPT (Chiang et al., 2023) is a dataset of 70K voluntarily-shared Chat GPT conversations (Share GPT, 2023). |
| Dataset Splits | No | The paper describes hyperparameter tuning through a 'coarse sweep on LLa MA-1 (7B) trained on the Alpaca dataset' but does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | Yes | We finetune the 7B parameter models on four A5000s and 13B parameters on eight A5000s using bfloat16 precision. |
| Software Dependencies | No | The paper mentions 'bfloat16 precision' and 'open source software' but does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | We use learning rate of 5e-5 and the Adam optimizer for all 7B models... We train all models for 3 epochs on all datasets setting the same seed for each run with an effective batch size of 128 (4 cards, batch size 4, 8 gradient accumulation steps). |