LIMA: Less Is More for Alignment
Authors: Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, Omer Levy
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure the relative importance of these two stages by training LIMA, a 65B parameter LLa Ma language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses... In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus Da Vinci003... We compare LIMA to state-of-the-art language models and products across 300 challenging test prompts. In a human preference study, we find that LIMA outperforms RLHF-trained Da Vinci003 from Open AI... Ablation experiments reveal vastly diminishing returns when scaling up data quantity... |
| Researcher Affiliation | Collaboration | πMeta AI πCarnegie Mellon University πUniversity of Southern California πTel Aviv University |
| Pseudocode | No | The paper describes its training protocol and methodology in text, for example in Section 3 'Training LIMA' and Section 4.1 'Experiment Setup', but it does not include any dedicated pseudocode or algorithm blocks. |
| Open Source Code | No | The paper references the LLaMA model and the Alpaca project (which has a GitHub link: 'https://github.com/tatsu-lab/stanford_alpaca') but it does not provide an explicit statement or link to the source code for the LIMA methodology itself. |
| Open Datasets | Yes | We curate 1,000 examples that approximate real user prompts and high-quality responses. We select 750 top questions and answers from community forums, such as Stack Exchange and wiki How... All data instances were mined from the Pushshift Reddit Dataset [Baumgartner et al., 2020]. We sample 50 training examples from Super-Natural Instructions [Wang et al., 2022b]. |
| Dataset Splits | Yes | We also collect a test set of 300 prompts and a development set of 50. We select 200 prompts from Group A for training and 50 prompts as a held-out development set. |
| Hardware Specification | No | The paper states it fine-tunes a 'LLa Ma 65B' model and a '7B parameter LLa Ma model' but does not provide specific details on the hardware (e.g., GPU model, CPU type, memory) used for these operations. |
| Software Dependencies | No | The paper mentions using 'Adam W' for optimization but does not provide specific software dependencies with version numbers, such as programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x, CUDA 11.x). |
| Experiment Setup | Yes | We fine-tune for 15 epochs using Adam W [Loshchilov and Hutter, 2017] with π½1 = 0.9, π½2 = 0.95, and weight decay of 0.1. Without warm-up steps, we set the initial learning rate to 1π 5 and linearly decaying to 1π 6 by the end of training. The batch size is set to 32 examples (64 for smaller models), and texts longer than 2048 tokens are trimmed. One notable deviation from the norm is the use of residual dropout; we follow Ouyang et al. [2022] and apply dropout over residual connections, starting at ππ= 0.0 at the bottom layer and linearly raising the rate to ππ= 0.3 at the last layer (ππ= 0.2 for smaller models). For each prompt, we generate a single response from each baseline model using nucleus sampling [Holtzman et al., 2019] with π= 0.9 and a temperature of π= 0.7. We apply a repetition penalty of previously generated tokens with a hyperparameter of 1.2 [Keskar et al., 2019]. We limit the maximum token length to 2048. |