WARM: On the Benefits of Weight Averaged Reward Models
Authors: Alexandre Rame, Nino Vieillard, Leonard Hussenot, Robert Dadashi-Tazehozi, Geoffrey Cideron, Olivier Bachem, Johan Ferret
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on summarization tasks, using best-of-N and RL methods, shows that WARM improves the quality and alignment of LLM predictions; for example, a policy RL finetuned with WARM has a 79.4% win rate against a policy RL fine-tuned with a single RM. |
| Researcher Affiliation | Industry | Alexandre Ram e 1 Nino Vieillard 1 L eonard Hussenot 1 Robert Dadashi 1 Geoffrey Cideron 1 Olivier Bachem 1 Johan Ferret 1 Google Deep Mind. Correspondence to: Alexandre Ram e <alexandrerame@google.com>. |
| Pseudocode | No | The paper includes diagrams illustrating the WARM procedure and Baklava strategy, but it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its source code or a link to a code repository. |
| Open Datasets | Yes | We use the TL;DR summarization benchmark (V olske et al., 2017)... In training, we use the dataset Dtrain from Stiennon et al. (2020) where the candidate summaries are generated by GPT-3 (Brown et al., 2020) variants. |
| Dataset Splits | Yes | The dataset contains 123k posts, and 5% is held out as the ID validation set. To generate the candidate responses in Dood with 92k pairwise comparisons, we considered multiple Pa LM-XS policies with high temperature, some are pre-trained only, others SFT-ed and others RLHF-ed; the goal was to get a diverse set of summaries. |
| Hardware Specification | No | The paper mentions using 'Cloud Vertex AI' for AI labeling and specifies the models used (e.g., Pa LM-XXS, Pa LM-XS, Pa LM-L), but it does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for training or inference in their experiments. |
| Software Dependencies | No | The paper mentions using the 'Adafactor (Shazeer & Stern, 2018) optimizer' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | We train all RMs for 10k steps, a batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate sampled in {1e-5,4e-5,1e-4}, and a dropout probability in {0.05, 0.1}. We then generate samples from the policy with temperature T = 0.9, batch size of 128, the Adafactor (Shazeer & Stern, 2018) optimizer, a learning rate of 10 5 and a policy warmup of 2k steps. We set α = 0.003 for the KL regularization in the main experiment without label corruption, and α = 0.01 with label corruption. |