reproducibilityindex.ai

Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning

Authors: Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, Natasha Jaques

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy.
Researcher Affiliation	Academia	Sriyash Poddar , Yanming Wan , Hamish Ivison, Abhishek Gupta , Natasha Jaques Paul G. Allen School of Computer Science and Engineering University of Washington Seattle, WA 98195 <sriyash, ymwan, hamishiv, abhgupta, nj>@cs.washington.edu
Pseudocode	Yes	Algorithm 1 Learning Multimodal Reward Functions using VPL, Algorithm 2 Policy Optimization using IQL and VPL, Algorithm 3 Policy Optimization using IQL and SPO + VPL
Open Source Code	No	We do not provide immediate access to the data and code, but will do so in the future.
Open Datasets	Yes	Maze-Navigation. This task is adapted from the 'maze2d-medium-v2' environment from the D4RL benchmark [27]. Ravens-Manipulation This task is adapted from the ravens benchmark [70]. Habitat-Rearrange This is a task based on the Meta Habitat simulator [68]. Ultra Feedback [20] dataset.
Dataset Splits	No	The paper describes data sampling and processing for context sets (e.g., 'randomly sample a smaller subset of K data points' and 'get 8 samples from the context set'), but does not explicitly define training, validation, and test dataset splits with percentages or counts for its experiments.
Hardware Specification	Yes	Computational Resources 2 RTX4090, 4 A100
Software Dependencies	No	The paper mentions 'Optimizer Adam (Kingma & Ba, 2015)' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	Table 2: Hyperparameters for learning reward models using VPL. We sweep over these values and report the best results on 5 seeds. Table 3: Hyperparameters for IQL. We use the same parameters across all experiments. Table 4: Hyperparameters for LLM experiments