WARM: On the Benefits of Weight Averaged Reward Models

Authors: Alexandre Rame, Nino Vieillard, Leonard Hussenot, Robert Dadashi-Tazehozi, Geoffrey Cideron, Olivier Bachem, Johan Ferret

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the quality and alignment of LLM predictions; for example, a policy RL finetuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.
Researcher Affiliation Industry Alexandre Ram e 1 Nino Vieillard 1 L eonard Hussenot 1 Robert Dadashi 1 Geoffrey Cideron 1 Olivier Bachem 1 Johan Ferret 1 Google Deep Mind. Correspondence to: Alexandre Ram e <alexandrerame@google.com>.
Pseudocode No The paper includes diagrams illustrating the WARM procedure and Baklava strategy, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes We use the TL;DR summarization benchmark (V olske et al., 2017)... In training, we use the dataset Dtrain from Stiennon et al. (2020) where the candidate summaries are generated by GPT-3 (Brown et al., 2020) variants.
Dataset Splits Yes The dataset contains 123k posts, and 5% is held out as the ID validation set. To generate the candidate responses in Dood with 92k pairwise comparisons, we considered multiple Pa LM-XS policies with high temperature, some are pre-trained only, others SFT-ed and others RLHF-ed; the goal was to get a diverse set of summaries.
Hardware Specification No The paper mentions using 'Cloud Vertex AI' for AI labeling and specifies the models used (e.g., Pa LM-XXS, Pa LM-XS, Pa LM-L), but it does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for training or inference in their experiments.
Software Dependencies No The paper mentions using the 'Adafactor (Shazeer & Stern, 2018) optimizer' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes We train all RMs for 10k steps, a batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate sampled in {1e-5,4e-5,1e-4}, and a dropout probability in {0.05, 0.1}. We then generate samples from the policy with temperature T = 0.9, batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate of 10 5 and a policy warmup of 2k steps. We set α = 0.003 for the KL regularization in the main experiment without label corruption, and α = 0.01 with label corruption.