Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Authors: Masatoshi Uehara, Wen Sun

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate the flexibility of CPPO... Our theoretical results provide a sharp contrast between model-based and model-free approaches in offline RL.
Researcher Affiliation Academia Masatoshi Uehara, Wen Sun Department of Computer Science Cornell University, Ithaca, NY 14850, USA {mu223,ws455}@cornell.edu
Pseudocode Yes Algorithm 1 Constrained Pessimistic Policy Optimization (CPPO)
Open Source Code No The paper does not include any statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets No The paper is theoretical and focuses on providing PAC guarantees and theoretical analysis under partial coverage. It mentions using an "offline dataset D" but does not describe using a specific, publicly available dataset with concrete access information for empirical training or evaluation.
Dataset Splits No The paper is theoretical and does not present empirical experiments. Therefore, there is no mention of dataset splits (training, validation, test) for reproducibility.
Hardware Specification No The paper is theoretical and does not describe running empirical experiments. Therefore, no hardware specifications are provided.
Software Dependencies No The paper is theoretical and does not describe running empirical experiments. Therefore, no specific software dependencies with version numbers are provided.
Experiment Setup No The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe an empirical experimental setup, hyperparameters, or system-level training settings.