reproducibilityindex.ai

Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation

Authors: Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, Liwei Wang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We prove that our algorithm achieves the regret bound of O(poly(d H) K), where d is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, H is the planning horizon, K is the number of episodes, and O( ) omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the Pb RL problem by formulating a novel problem called RL with n-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for Pb RL with (general) function approximation.
Researcher Affiliation	Academia	1Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University 2Center for Data Science, Peking University 3Peng Cheng Laboratory 4Department of Statistics and Data Science, Yale University 5Department of Industrial Engineering and Management Sciences, Northwestern University.
Pseudocode	Yes	Algorithm 1 Pb OP: Preference-based Optimistic Planning; Algorithm 2 Reduction Protocol; Algorithm 3 Pb OP+: Pairwise Preference-based Optimistic Planning; Algorithm 4 RL with Trajectory Feedback
Open Source Code	No	The paper does not include any explicit statements about making source code available or links to code repositories.
Open Datasets	No	This paper is theoretical and focuses on proving regret bounds and developing algorithms for reinforcement learning with general function approximation. It does not describe any experiments involving specific datasets, training processes, or data splits.
Dataset Splits	No	This paper is theoretical and does not describe empirical experiments involving dataset splits for validation. The focus is on algorithm design and theoretical guarantees.
Hardware Specification	No	The paper is theoretical and describes algorithms and proofs. It does not include any information about hardware used for experiments.
Software Dependencies	No	The paper is theoretical and focuses on algorithm design and proofs. It does not mention any specific software dependencies or versions for experimental setup.
Experiment Setup	No	The paper is theoretical and describes algorithms and proofs. It does not include details on experimental setup, hyperparameters, or training configurations.