Reward Model Ensembles Help Mitigate Overoptimization

Authors: Thomas Coste, Usman Anwar, Robert Kirk, David Krueger

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (Bo N) (b) proximal policy optimization (PPO).
Researcher Affiliation Academia Thomas Coste1, Usman Anwar1, Robert Kirk2, David Krueger1 1University of Cambridge, 2University College London
Pseudocode No The paper describes methods and formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code is available at: https://github.com/tlc4418/llm_optimization.
Open Datasets Yes In order to train proxy reward models, we use the Alpaca dataset (Taori et al., 2023), with 52, 000 instructions covering a range of commands and corresponding demonstrations generated by Open AI s text-davinci-003 (Open AI, 2023a).
Dataset Splits Yes More specifically, we use the Alpaca Farm (Dubois et al., 2023) variant of the dataset, as it provides splits for use in the different RLHF stages and for validation, as well as human preference annotations. Further details concerning the splits, prompt format, and examples are given in Appendix D.
Hardware Specification Yes 1Generating the n = 12, 500 answers for 1000 prompts and then relabeling them with proxy and gold reward model takes approximately 700 A100 GPU hours.
Software Dependencies No The paper mentions specific models and data generation tools but does not list general software dependencies (e.g., libraries, frameworks) with specific version numbers.
Experiment Setup Yes Table 1: SFT hyperparameters.