Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage
Authors: Masatoshi Uehara, Wen Sun
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We present an algorithm named Constrained Pessimistic Policy Optimization (CPPO) which leverages a general function class and uses a constraint over the model class to encode pessimism. Under the assumption that the ground truth model belongs to our function class (i.e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i.e., it can learn a policy that competes against any policy that is covered by the offline data. We then demonstrate the flexibility of CPPO... Our theoretical results provide a sharp contrast between model-based and model-free approaches in offline RL. |
| Researcher Affiliation | Academia | Masatoshi Uehara, Wen Sun Department of Computer Science Cornell University, Ithaca, NY 14850, USA {mu223,ws455}@cornell.edu |
| Pseudocode | Yes | Algorithm 1 Constrained Pessimistic Policy Optimization (CPPO) |
| Open Source Code | No | The paper does not include any statement about releasing source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | No | The paper is theoretical and focuses on providing PAC guarantees and theoretical analysis under partial coverage. It mentions using an "offline dataset D" but does not describe using a specific, publicly available dataset with concrete access information for empirical training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not present empirical experiments. Therefore, there is no mention of dataset splits (training, validation, test) for reproducibility. |
| Hardware Specification | No | The paper is theoretical and does not describe running empirical experiments. Therefore, no hardware specifications are provided. |
| Software Dependencies | No | The paper is theoretical and does not describe running empirical experiments. Therefore, no specific software dependencies with version numbers are provided. |
| Experiment Setup | No | The paper is theoretical and focuses on algorithm design and theoretical guarantees. It does not describe an empirical experimental setup, hyperparameters, or system-level training settings. |