Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation
Authors: Xiaoyu Chen, Han Zhong, Zhuoran Yang, Zhaoran Wang, Liwei Wang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We prove that our algorithm achieves the regret bound of O(poly(d H) K), where d is the complexity measure of the transition and preference model depending on the Eluder dimension and log-covering numbers, H is the planning horizon, K is the number of episodes, and O( ) omits logarithmic terms. Our lower bound indicates that our algorithm is near-optimal when specialized to the linear setting. Furthermore, we extend the Pb RL problem by formulating a novel problem called RL with n-wise comparisons, and provide the first sample-efficient algorithm for this new setting. To the best of our knowledge, this is the first theoretical result for Pb RL with (general) function approximation. |
| Researcher Affiliation | Academia | 1Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University 2Center for Data Science, Peking University 3Peng Cheng Laboratory 4Department of Statistics and Data Science, Yale University 5Department of Industrial Engineering and Management Sciences, Northwestern University. |
| Pseudocode | Yes | Algorithm 1 Pb OP: Preference-based Optimistic Planning; Algorithm 2 Reduction Protocol; Algorithm 3 Pb OP+: Pairwise Preference-based Optimistic Planning; Algorithm 4 RL with Trajectory Feedback |
| Open Source Code | No | The paper does not include any explicit statements about making source code available or links to code repositories. |
| Open Datasets | No | This paper is theoretical and focuses on proving regret bounds and developing algorithms for reinforcement learning with general function approximation. It does not describe any experiments involving specific datasets, training processes, or data splits. |
| Dataset Splits | No | This paper is theoretical and does not describe empirical experiments involving dataset splits for validation. The focus is on algorithm design and theoretical guarantees. |
| Hardware Specification | No | The paper is theoretical and describes algorithms and proofs. It does not include any information about hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and focuses on algorithm design and proofs. It does not mention any specific software dependencies or versions for experimental setup. |
| Experiment Setup | No | The paper is theoretical and describes algorithms and proofs. It does not include details on experimental setup, hyperparameters, or training configurations. |