LIMA: Less Is More for Alignment

Authors: Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure the relative importance of these two stages by training LIMA, a 65B parameter LLa Ma language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses... In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus Da Vinci003... We compare LIMA to state-of-the-art language models and products across 300 challenging test prompts. In a human preference study, we find that LIMA outperforms RLHF-trained Da Vinci003 from Open AI... Ablation experiments reveal vastly diminishing returns when scaling up data quantity...
Researcher Affiliation Collaboration πœ‡Meta AI πœ‹Carnegie Mellon University πœ†University of Southern California 𝜏Tel Aviv University
Pseudocode No The paper describes its training protocol and methodology in text, for example in Section 3 'Training LIMA' and Section 4.1 'Experiment Setup', but it does not include any dedicated pseudocode or algorithm blocks.
Open Source Code No The paper references the LLaMA model and the Alpaca project (which has a GitHub link: 'https://github.com/tatsu-lab/stanford_alpaca') but it does not provide an explicit statement or link to the source code for the LIMA methodology itself.
Open Datasets Yes We curate 1,000 examples that approximate real user prompts and high-quality responses. We select 750 top questions and answers from community forums, such as Stack Exchange and wiki How... All data instances were mined from the Pushshift Reddit Dataset [Baumgartner et al., 2020]. We sample 50 training examples from Super-Natural Instructions [Wang et al., 2022b].
Dataset Splits Yes We also collect a test set of 300 prompts and a development set of 50. We select 200 prompts from Group A for training and 50 prompts as a held-out development set.
Hardware Specification No The paper states it fine-tunes a 'LLa Ma 65B' model and a '7B parameter LLa Ma model' but does not provide specific details on the hardware (e.g., GPU model, CPU type, memory) used for these operations.
Software Dependencies No The paper mentions using 'Adam W' for optimization but does not provide specific software dependencies with version numbers, such as programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes We fine-tune for 15 epochs using Adam W [Loshchilov and Hutter, 2017] with 𝛽1 = 0.9, 𝛽2 = 0.95, and weight decay of 0.1. Without warm-up steps, we set the initial learning rate to 1𝑒 5 and linearly decaying to 1𝑒 6 by the end of training. The batch size is set to 32 examples (64 for smaller models), and texts longer than 2048 tokens are trimmed. One notable deviation from the norm is the use of residual dropout; we follow Ouyang et al. [2022] and apply dropout over residual connections, starting at 𝑝𝑑= 0.0 at the bottom layer and linearly raising the rate to 𝑝𝑑= 0.3 at the last layer (𝑝𝑑= 0.2 for smaller models). For each prompt, we generate a single response from each baseline model using nucleus sampling [Holtzman et al., 2019] with 𝑝= 0.9 and a temperature of 𝜏= 0.7. We apply a repetition penalty of previously generated tokens with a hyperparameter of 1.2 [Keskar et al., 2019]. We limit the maximum token length to 2048.