reproducibilityindex.ai

WARM: On the Benefits of Weight Averaged Reward Models

Authors: Alexandre Rame, Nino Vieillard, Leonard Hussenot, Robert Dadashi-Tazehozi, Geoffrey Cideron, Olivier Bachem, Johan Ferret

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the quality and alignment of LLM predictions; for example, a policy RL finetuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM.
Researcher Affiliation	Industry	Alexandre Ram e 1 Nino Vieillard 1 L eonard Hussenot 1 Robert Dadashi 1 Geoffrey Cideron 1 Olivier Bachem 1 Johan Ferret 1 Google Deep Mind. Correspondence to: Alexandre Ram e <alexandrerame@google.com>.
Pseudocode	No	The paper includes diagrams illustrating the WARM procedure and Baklava strategy, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets	Yes	We use the TL;DR summarization benchmark (V olske et al., 2017)... In training, we use the dataset Dtrain from Stiennon et al. (2020) where the candidate summaries are generated by GPT-3 (Brown et al., 2020) variants.
Dataset Splits	Yes	The dataset contains 123k posts, and 5% is held out as the ID validation set. To generate the candidate responses in Dood with 92k pairwise comparisons, we considered multiple Pa LM-XS policies with high temperature, some are pre-trained only, others SFT-ed and others RLHF-ed; the goal was to get a diverse set of summaries.
Hardware Specification	No	The paper mentions using 'Cloud Vertex AI' for AI labeling and specifies the models used (e.g., Pa LM-XXS, Pa LM-XS, Pa LM-L), but it does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for training or inference in their experiments.
Software Dependencies	No	The paper mentions using the 'Adafactor (Shazeer & Stern, 2018) optimizer' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	We train all RMs for 10k steps, a batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate sampled in {1e-5,4e-5,1e-4}, and a dropout probability in {0.05, 0.1}. We then generate samples from the policy with temperature T = 0.9, batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate of 10 5 and a policy warmup of 2k steps. We set α = 0.003 for the KL regularization in the main experiment without label corruption, and α = 0.01 with label corruption.