reproducibilityindex.ai

Provable Offline Preference-Based Reinforcement Learning

Authors: Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE. We consider the general reward setting where the reward can be defined over the whole trajectory and provide a novel guarantee that allows us to learn any target policy with a polynomial number of samples, as long as the target policy is covered by the offline data. This guarantee is the first of its kind with general function approximation. To measure the coverage of the target policy, we introduce a new single-policy concentrability coefficient, which can be upper bounded by the per-trajectory concentrability coefficient. We also establish lower bounds that highlight the necessity of such concentrability and the difference from standard RL, where state-action-wise rewards are directly observed. We further extend and analyze our algorithm when the feedback is given over action pairs.
Researcher Affiliation	Collaboration	Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Masatoshi Uehara Genentech uehara.masatoshi@gene.com Nathan Kallus Cornell University kallus@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wen Sun Cornell University ws455@cornell.edu
Pseudocode	Yes	Algorithm 1 FREEHAND: o Ffline Reinforcem Ent l Earning with Hum AN fee Dback
Open Source Code	No	The paper does not contain any statements or links regarding the release or availability of open-source code for the described methodology.
Open Datasets	No	The paper is theoretical and focuses on algorithm design and analysis, rather than empirical evaluation. It does not refer to any specific publicly available datasets for training, validation, or testing. It mentions 'offline dataset D' in a theoretical context.
Dataset Splits	No	The paper is theoretical and does not describe empirical experiments or dataset splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not describe any experiments or the hardware used to run them.
Software Dependencies	No	The paper is theoretical and does not describe software implementations or specific software dependencies with version numbers.
Experiment Setup	No	The paper is theoretical and does not describe any experimental setups, hyperparameters, or training configurations.