Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason E Weston

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we start with a Llama 2 70B (Touvron et al., 2023) seed model fine-tuned on Open Assistant (Köpf et al., 2023), and then perform the above training scheme. We find that not only does the instruction following performance improve from Self-Rewarding LLM alignment compared to the baseline seed model, but importantly the reward modeling ability, which is no longer fixed, improves as well.
Researcher Affiliation Collaboration 1Meta. 2New York University.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository for it.
Open Datasets Yes We use the human-authored examples provided in the Open Assistant dataset (Köpf et al., 2023) for instruction fine-tuning.
Dataset Splits Yes EFT Seed Data: We split this into train and evaluation sets, and use it to create LLM-as-a Judge data. This is done by placing it in the input prompt format (detailed in Figure 6 in Appendix), which consists of the scoring criteria description, and the given instruction and response to be evaluated. For training targets, chain-of-thought justifications and final scores out of 5 are not directly provided, so we use the SFT baseline to generate such output evaluations for each input, and accept them into the training set if the ranking of their scores agrees with the human rankings in the dataset. We resample the training set by discarding some of the data that receives the most common score so that the scores are not too skewed, as we observe many samples receive a score of 4. This results in 1,630 train and 541 evaluation examples (which do not overlap with the IFT data).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or memory specifications used for running experiments.
Software Dependencies No The paper does not provide specific software dependency details with version numbers (e.g., 'PyTorch 1.9', 'CUDA 11.1') for replicating the experiments.
Experiment Setup Yes For SFT we use learning rate 5.5e 6 which decays (cosine) to 1.1e 6 at the end of training, batch size 16 and dropout 0.1. We only calculate the loss on target tokens instead of the full sequence. For DPO we use learning rate 1e 6 which decays to 1e 7, batch size 16, dropout 0.1, and a β value of 0.1.