A Minimaximalist Approach to Reinforcement Learning from Human Feedback
Authors: Gokul Swamy, Christoph Dann, Rahul Kidambi, Steven Wu, Alekh Agarwal
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Google Research. |
| Pseudocode | Yes | Algorithm 1 SPO (Theoretical Version) Algorithm 2 SPO (Practical Version) |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code specific to the methodology described. |
| Open Datasets | Yes | We evaluate SPO experimentally and compare it against an iterative Reward Modeling (RM) approach along several axes. We focus on the context-free, online oracle setting and leave exploration of the contextual, offline dataset setting to future work. Specifically, we ask the following questions: 1. Can SPO compute MWs when faced with intransitive preferences? We consider aggregating three populations in different proportions, each of which has transitive preferences internally. We measure how far off SPO is from the exact MW. We also present qualitative results on a continuous control task from Mujoco, (Brockman et al., 2016) where computing the MW for comparison is infeasible. |
| Dataset Splits | No | The paper mentions 'validation data' in Algorithm 2 but does not provide specific details on the dataset splits (percentages, counts, or methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run the experiments. |
| Software Dependencies | No | We use Soft Actor Critic (SAC, Haarnoja et al. (2018)) for continuous control and Proximal Policy Optimization (PPO, Schulman et al. (2017)) for discrete action tasks, both as implemented in the ACME framework (Hoffman et al., 2020). The paper mentions software by name but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the SAC implementation in Hoffman et al. (2020) for all of our continuous control experiments. We use the same SAC hyperparameters for all methods, other than the fact that we use 3e-4 rather than 3e-5 as the learning rate for vanilla SAC. We use Adam for all optimization. We use three layer networks of width 256 for all function approximation. We use Re LU activations for the actor and critic. Table 2: Hyperparameters for SAC. |