Provable Reward-Agnostic Preference-Based Reinforcement Learning
Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this study, we fill in such a gap between theoretical Pb RL and practical algorithms by proposing a theoretical reward-agnostic Pb RL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature. |
| Researcher Affiliation | Collaboration | Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Wen Sun Cornell University ws455@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu |
| Pseudocode | Yes | Algorithm 1 REGIME: Experimental Design for Querying Human Preference; Algorithm 2 REGIME-lin; Algorithm 3 REGIME-action; Algorithm 4 REGIME-exploration; Algorithm 5 REGIME-planning |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating the release of open-source code for the described methodology. |
| Open Datasets | No | This paper is theoretical in nature, focusing on algorithmic design and theoretical analysis of sample complexity. It does not conduct empirical experiments using specific datasets, and therefore, does not provide concrete access information for a publicly available or open dataset for training purposes. |
| Dataset Splits | No | This paper is theoretical, focusing on algorithmic design and theoretical analysis rather than empirical experiments. As such, it does not provide specific dataset split information for validation. |
| Hardware Specification | No | This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific hardware details used for running experiments are mentioned. |
| Software Dependencies | No | This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific ancillary software details with version numbers are provided. |
| Experiment Setup | No | This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific experimental setup details such as hyperparameters or training configurations are provided. |