Confronting Reward Model Overoptimization with Constrained RLHF

Authors: Ted Moskovitz, Aaditya K Singh, DJ Strouse, Tuomas Sandholm, Ruslan Salakhutdinov, Anca Dragan, Stephen Marcus McAleer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.
Researcher Affiliation Collaboration Ted Moskovitz Gatsby Unit, UCL Aaditya K. Singh Gatsby Unit, UCL DJ Strouse Google Deep Mind Tuomas Sandholm Carnegie Mellon University Ruslan Salakhutdinov Carnegie Mellon University Anca D. Dragan University of California, Berkeley Stephen Mc Aleer Carnegie Mellon University
Pseudocode Yes Detailed pseudocode is provided in Algorithm 1.
Open Source Code Yes Code for all methods is available here: github.com/tedmoskovitz/Constrained RL4LMs.
Open Datasets Yes We focus on a single setting as a case study: dialogue generation with the Daily Dialog (Li et al., 2017) dataset, which consists of transcripts of conversations between humans.
Dataset Splits Yes The context window was of length 5, and separating the conversations in this way resulted in 35k training, 3k validation, and 3k test utterances.
Hardware Specification Yes All experiments were performed on a single NVIDIA A100 GPU
Software Dependencies No The paper mentions using specific packages like the 'Sci Py optimize package (Virtanen et al., 2020)' and references 'L-BFGS-B Zhu et al. (1997)', but it does not provide version numbers for the broader software stack or general dependencies like Python or PyTorch, which are crucial for reproducibility.
Experiment Setup Yes Table 2: Experiment Hyperparameters. Hyperparameter PPO PPO-SAT µ-PPO All-PPO ξ-PPO Steps per Update (M ) 1,280 ... GAE λ 0.95 ... Lagrange Multiplier Function sigmoid sigmoid tanh