Understanding the Effects of RLHF on LLM Generalisation and Diversity

Authors: Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, Roberta Raileanu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model s ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity.
Researcher Affiliation Collaboration α University College London, β Meta, γ University of Oxford
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. It describes processes in text and mathematical formulas.
Open Source Code Yes We open source our code to enable reproducible research here: https://github.com/facebookresearch/rlfh-gen-div.
Open Datasets Yes Summarisation. We use the same dataset as Stiennon et al. (2022), which is a filtered version of the TL;DR dataset (Völske et al., 2017), consisting of approximately 120,000 Reddit posts with accompanying summaries. [...] We use the SFT, RLHF, and RM models released by Dubois et al. (2023, Alpaca Farm).
Dataset Splits Yes The dataset we use from (Stiennon et al., 2022) (filtered from (Nallapati et al., 2016)) comes with train, validation and test splits, which we use throughout our work. [...] We then train on the ID train set, do model selection using the ID validation set, and evaluate on the ID and OOD test sets to measure the in-distribution and out-of-distribution performance.
Hardware Specification No The paper mentions using "LLa Ma pretrained 7 billion parameter model" and "OPT models" of various sizes, but does not specify the hardware (e.g., CPU, GPU models, memory, cloud instances) used for training or inference.
Software Dependencies No The paper refers to various tools and frameworks such as "reinforcement learning from human feedback (Christiano et al., 2017; Ziegler et al., 2020, RLHF)", "Proximal Policy Optimization (PPO) (Schulman et al., 2017)", "GPT-4 (Open AI, 2023)", "alpaca_eval (Li et al., 2023)", and "Text Blob (Loria, 2013)". However, it does not provide specific version numbers for these software components.
Experiment Setup Yes For each model type (SFT, RM, RLHF) we do a sweep over learning rate, choosing ranges of values informed by choices in previous work (Stiennon et al., 2022) and early experimentation. The results in the paper are the best model with the learning rate chosen on an in-distribution validation set using loss, accuracy and reward respectively for SFT, RM and RLHF training. The learning rates for SFT are 3e-4, 1e-4, 3e-5, with 3e-5 selected; for RMs are 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, with 3e-5 selected; for RLHF are: 1.5e-6, 3e-6, 6e-6, 1.5e-5, 3e-5, with 1.5e-5 selected. We list the other hyperparameters (which are unchanged between all runs) for SFT, RM and RLHF training in Table 2, Table 3 and Table 4 respectively.