Provable Offline Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
Researcher Affiliation Collaboration Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Nathan Kallus Cornell University kallus@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wen Sun Cornell University ws455@cornell.edu
Pseudocode Yes Algorithm 1 FREEHAND: o Ffline Reinforcem Ent l Earning with Hum AN fee Dback
Open Source Code No The paper does not contain any statements or links regarding the release or availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical and focuses on algorithm design and analysis, rather than empirical evaluation. It does not refer to any specific publicly available datasets for training, validation, or testing. It mentions 'offline dataset D' in a theoretical context.
Dataset Splits No The paper is theoretical and does not describe empirical experiments or dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any experiments or the hardware used to run them.
Software Dependencies No The paper is theoretical and does not describe software implementations or specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe any experimental setups, hyperparameters, or training configurations.