Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian

Authors: Paria Rashidinejad, Hanlin Zhu, Kunhe Yang, Stuart Russell, Jiantao Jiao

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We conduct theoretical investigations and design algorithms starting from multi-armed bandits (MABs), going forward to contextual bandits (CBs), and finally Markov decision processes (MDPs).
Researcher Affiliation Academia Paria Rashidinejad Hanlin Zhu Kunhe Yang Stuart Russell Jiantao Jiao , Department of Electrical Engineering and Computer Sciences Department of Statistics University of California, Berkeley {paria.rashidinejad,hanlinzhu,kunheyang,russell,jiantao}@berkeley.edu
Pseudocode Yes Algorithm 1 ALM with MIS (ALMIS) for offline MAB Algorithm 2 ALM with MIS (ALMIS) for offline CB Algorithm 3 ALM with MIS (ALMIS) for offline RL Model-based Algorithm 4 ALM with MIS (ALMIS) for offline RL Model-free
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets No The paper mentions using a "previously-collected offline dataset D = {(si, ai, ri, s i)}N i=1" and a dataset D0 = {si}N i=1 for MDPs, but it does not specify any publicly available datasets by name (e.g., CIFAR-10, ImageNet) nor does it provide a link or citation for a specific dataset used in any empirical evaluations.
Dataset Splits No The paper focuses on theoretical analysis and algorithm design with proofs; it does not include empirical experiments with explicit dataset splits for training, validation, and testing.
Hardware Specification No The paper is theoretical and does not describe any experimental hardware specifications.
Software Dependencies No The paper is theoretical and does not list specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and focuses on algorithm design and proofs. It does not provide details on experimental setup such as hyperparameters or system-level training settings.