reproducibilityindex.ai

Reward Model Ensembles Help Mitigate Overoptimization

Authors: Thomas Coste, Usman Anwar, Robert Kirk, David Krueger

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (Bo N) (b) proximal policy optimization (PPO).
Researcher Affiliation	Academia	Thomas Coste1, Usman Anwar1, Robert Kirk2, David Krueger1 1University of Cambridge, 2University College London
Pseudocode	No	The paper describes methods and formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at: https://github.com/tlc4418/llm_optimization.
Open Datasets	Yes	In order to train proxy reward models, we use the Alpaca dataset (Taori et al., 2023), with 52, 000 instructions covering a range of commands and corresponding demonstrations generated by Open AI s text-davinci-003 (Open AI, 2023a).
Dataset Splits	Yes	More specifically, we use the Alpaca Farm (Dubois et al., 2023) variant of the dataset, as it provides splits for use in the different RLHF stages and for validation, as well as human preference annotations. Further details concerning the splits, prompt format, and examples are given in Appendix D.
Hardware Specification	Yes	1Generating the n = 12, 500 answers for 1000 prompts and then relabeling them with proxy and gold reward model takes approximately 700 A100 GPU hours.
Software Dependencies	No	The paper mentions specific models and data generation tools but does not list general software dependencies (e.g., libraries, frameworks) with specific version numbers.
Experiment Setup	Yes	Table 1: SFT hyperparameters.