Provable Offline Preference-Based Reinforcement Learning
Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs. |
| Researcher Affiliation | Collaboration | Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Nathan Kallus Cornell University kallus@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wen Sun Cornell University ws455@cornell.edu |
| Pseudocode | Yes | Algorithm 1 FREEHAND: o Ffline Reinforcem Ent l Earning with Hum AN fee Dback |
| Open Source Code | No | The paper does not contain any statements or links regarding the release or availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and focuses on algorithm design and analysis, rather than empirical evaluation. It does not refer to any specific publicly available datasets for training, validation, or testing. It mentions 'offline dataset D' in a theoretical context. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments or dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper is theoretical and does not describe any experiments or the hardware used to run them. |
| Software Dependencies | No | The paper is theoretical and does not describe software implementations or specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe any experimental setups, hyperparameters, or training configurations. |