Provable Reward-Agnostic Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this study, we fill in such a gap between theoretical Pb RL and practical algorithms by proposing a theoretical reward-agnostic Pb RL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature.
Researcher Affiliation Collaboration Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Wen Sun Cornell University ws455@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu
Pseudocode Yes Algorithm 1 REGIME: Experimental Design for Querying Human Preference; Algorithm 2 REGIME-lin; Algorithm 3 REGIME-action; Algorithm 4 REGIME-exploration; Algorithm 5 REGIME-planning
Open Source Code No The paper does not provide an explicit statement or link indicating the release of open-source code for the described methodology.
Open Datasets No This paper is theoretical in nature, focusing on algorithmic design and theoretical analysis of sample complexity. It does not conduct empirical experiments using specific datasets, and therefore, does not provide concrete access information for a publicly available or open dataset for training purposes.
Dataset Splits No This paper is theoretical, focusing on algorithmic design and theoretical analysis rather than empirical experiments. As such, it does not provide specific dataset split information for validation.
Hardware Specification No This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific hardware details used for running experiments are mentioned.
Software Dependencies No This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific ancillary software details with version numbers are provided.
Experiment Setup No This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific experimental setup details such as hyperparameters or training configurations are provided.