Reward Model Ensembles Help Mitigate Overoptimization
Authors: Thomas Coste, Usman Anwar, Robert Kirk, David Krueger
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (Bo N) (b) proximal policy optimization (PPO). |
| Researcher Affiliation | Academia | Thomas Coste1, Usman Anwar1, Robert Kirk2, David Krueger1 1University of Cambridge, 2University College London |
| Pseudocode | No | The paper describes methods and formulas but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at: https://github.com/tlc4418/llm_optimization. |
| Open Datasets | Yes | In order to train proxy reward models, we use the Alpaca dataset (Taori et al., 2023), with 52, 000 instructions covering a range of commands and corresponding demonstrations generated by Open AI s text-davinci-003 (Open AI, 2023a). |
| Dataset Splits | Yes | More specifically, we use the Alpaca Farm (Dubois et al., 2023) variant of the dataset, as it provides splits for use in the different RLHF stages and for validation, as well as human preference annotations. Further details concerning the splits, prompt format, and examples are given in Appendix D. |
| Hardware Specification | Yes | 1Generating the n = 12, 500 answers for 1000 prompts and then relabeling them with proxy and gold reward model takes approximately 700 A100 GPU hours. |
| Software Dependencies | No | The paper mentions specific models and data generation tools but does not list general software dependencies (e.g., libraries, frameworks) with specific version numbers. |
| Experiment Setup | Yes | Table 1: SFT hyperparameters. |