Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
Authors: Sriyash Poddar, Yanming Wan, Hamish Ivison, Abhishek Gupta, Natasha Jaques
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically validate our proposed technique, we first show that it can provide a way to combat underspecification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. |
| Researcher Affiliation | Academia | Sriyash Poddar , Yanming Wan , Hamish Ivison, Abhishek Gupta , Natasha Jaques Paul G. Allen School of Computer Science and Engineering University of Washington Seattle, WA 98195 <sriyash, ymwan, hamishiv, abhgupta, nj>@cs.washington.edu |
| Pseudocode | Yes | Algorithm 1 Learning Multimodal Reward Functions using VPL, Algorithm 2 Policy Optimization using IQL and VPL, Algorithm 3 Policy Optimization using IQL and SPO + VPL |
| Open Source Code | No | We do not provide immediate access to the data and code, but will do so in the future. |
| Open Datasets | Yes | Maze-Navigation. This task is adapted from the 'maze2d-medium-v2' environment from the D4RL benchmark [27]. Ravens-Manipulation This task is adapted from the ravens benchmark [70]. Habitat-Rearrange This is a task based on the Meta Habitat simulator [68]. Ultra Feedback [20] dataset. |
| Dataset Splits | No | The paper describes data sampling and processing for context sets (e.g., 'randomly sample a smaller subset of K data points' and 'get 8 samples from the context set'), but does not explicitly define training, validation, and test dataset splits with percentages or counts for its experiments. |
| Hardware Specification | Yes | Computational Resources 2 RTX4090, 4 A100 |
| Software Dependencies | No | The paper mentions 'Optimizer Adam (Kingma & Ba, 2015)' but does not provide specific version numbers for software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | Table 2: Hyperparameters for learning reward models using VPL. We sweep over these values and report the best results on 5 seeds. Table 3: Hyperparameters for IQL. We use the same parameters across all experiments. Table 4: Hyperparameters for LLM experiments |