reproducibilityindex.ai

Provable Reward-Agnostic Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this study, we ﬁll in such a gap between theoretical Pb RL and practical algorithms by proposing a theoretical reward-agnostic Pb RL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired before collecting any human feedback. Theoretical analysis demonstrates that our algorithm requires less human feedback for learning the optimal policy under preference-based models with linear parameterization and unknown transitions, compared to the existing theoretical literature.
Researcher Affiliation	Collaboration	Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Wen Sun Cornell University ws455@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu
Pseudocode	Yes	Algorithm 1 REGIME: Experimental Design for Querying Human Preference; Algorithm 2 REGIME-lin; Algorithm 3 REGIME-action; Algorithm 4 REGIME-exploration; Algorithm 5 REGIME-planning
Open Source Code	No	The paper does not provide an explicit statement or link indicating the release of open-source code for the described methodology.
Open Datasets	No	This paper is theoretical in nature, focusing on algorithmic design and theoretical analysis of sample complexity. It does not conduct empirical experiments using specific datasets, and therefore, does not provide concrete access information for a publicly available or open dataset for training purposes.
Dataset Splits	No	This paper is theoretical, focusing on algorithmic design and theoretical analysis rather than empirical experiments. As such, it does not provide specific dataset split information for validation.
Hardware Specification	No	This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific hardware details used for running experiments are mentioned.
Software Dependencies	No	This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific ancillary software details with version numbers are provided.
Experiment Setup	No	This paper is theoretical and focuses on algorithmic design and analysis. It does not conduct empirical experiments, and therefore, no specific experimental setup details such as hyperparameters or training configurations are provided.