reproducibilityindex.ai

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

Authors: Gokul Swamy, Christoph Dann, Rahul Kidambi, Steven Wu, Alekh Agarwal

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Google Research.
Pseudocode	Yes	Algorithm 1 SPO (Theoretical Version) Algorithm 2 SPO (Practical Version)
Open Source Code	No	The paper does not provide an explicit statement or link for open-source code specific to the methodology described.
Open Datasets	Yes	We evaluate SPO experimentally and compare it against an iterative Reward Modeling (RM) approach along several axes. We focus on the context-free, online oracle setting and leave exploration of the contextual, offline dataset setting to future work. Specifically, we ask the following questions: 1. Can SPO compute MWs when faced with intransitive preferences? We consider aggregating three populations in different proportions, each of which has transitive preferences internally. We measure how far off SPO is from the exact MW. We also present qualitative results on a continuous control task from Mujoco, (Brockman et al., 2016) where computing the MW for comparison is infeasible.
Dataset Splits	No	The paper mentions 'validation data' in Algorithm 2 but does not provide specific details on the dataset splits (percentages, counts, or methodology) for training, validation, or testing.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU models, memory) used to run the experiments.
Software Dependencies	No	We use Soft Actor Critic (SAC, Haarnoja et al. (2018)) for continuous control and Proximal Policy Optimization (PPO, Schulman et al. (2017)) for discrete action tasks, both as implemented in the ACME framework (Hoffman et al., 2020). The paper mentions software by name but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We use the SAC implementation in Hoffman et al. (2020) for all of our continuous control experiments. We use the same SAC hyperparameters for all methods, other than the fact that we use 3e-4 rather than 3e-5 as the learning rate for vanilla SAC. We use Adam for all optimization. We use three layer networks of width 256 for all function approximation. We use Re LU activations for the actor and critic. Table 2: Hyperparameters for SAC.