Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
Authors: Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, Natasha Jaques
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. |
| Researcher Affiliation | Academia | Sriyash Poddar , Yanming Wan , Hamish Ivison, Abhishek Gupta , Natasha Jaques Paul G. Allen School of Computer Science and Engineering University of Washington Seattle, WA 98195 <sriyash, ymwan, hamishiv, abhgupta, nj>@cs.washington.edu |
| Pseudocode | Yes | Algorithm 1 Learning Multimodal Reward Functions using VPL, Algorithm 2 Policy Optimization using IQL and VPL, Algorithm 3 Policy Optimization using IQL and SPO + VPL |
| Open Source Code | No | We do not provide immediate access to the data and code, but will do so in the future. |
| Open Datasets | Yes | Maze-Navigation. This task is adapted from the 'maze2d-medium-v2' environment from the D4RL benchmark [27]. Ravens-Manipulation This task is adapted from the ravens benchmark [70]. Habitat-Rearrange This is a task based on the Meta Habitat simulator [68]. Ultra Feedback [20] dataset. |
| Dataset Splits | No | The paper describes data sampling and processing for context sets (e.g., 'randomly sample a smaller subset of K data points' and 'get 8 samples from the context set'), but does not explicitly define training, validation, and test dataset splits with percentages or counts for its experiments. |
| Hardware Specification | Yes | Computational Resources 2 RTX4090, 4 A100 |
| Software Dependencies | No | The paper mentions 'Optimizer Adam (Kingma & Ba, 2015)' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Table 2: Hyperparameters for learning reward models using VPL. We sweep over these values and report the best results on 5 seeds. Table 3: Hyperparameters for IQL. We use the same parameters across all experiments. Table 4: Hyperparameters for LLM experiments |