Bellman-consistent Pessimism for Offline Reinforcement Learning
Authors: Tengyang Xie, Ching-An Cheng, Nan Jiang, Paul Mineiro, Alekh Agarwal
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. The approach uses the offline dataset to first compute a lower bound on the value of each policy π Π, and then returns the policy with the highest pessimistic value estimate. While this high-level template is at the heart of many recent approaches [e.g., Fujimoto et al., 2019; Kumar et al., 2019; Liu et al., 2020; Kidambi et al., 2020; Yu et al., 2020; Kumar et al., 2020], our main novelty is in the design and analysis of Bellman-consistent pessimism for general function approximation. As of limitations and future work, the sample complexity of our practical algorithm is worse than that of the information-theoretic approach, and it will be interesting to close this gap. Another future direction is to empirically evaluate PSPI on benchmarks and compare it to existing approaches. |
| Researcher Affiliation | Collaboration | Tengyang Xie UIUC tx10@illinois.edu Ching-An Cheng Microsoft Research chinganc@microsoft.com Nan Jiang UIUC nanjiang@illinois.edu Paul Mineiro Microsoft Research pmineiro@microsoft.com Alekh Agarwal Google Research alekhagarwal@google.com |
| Pseudocode | Yes | Algorithm 1 PSPI: Pessimistic Soft Policy Iteration |
| Open Source Code | No | The paper does not provide any links to open-source code for the described methodology, nor does it state that such code will be released or is available in supplementary materials. |
| Open Datasets | No | We assume the standard i.i.d. data generation protocol in our theoretical derivations, that the offline dataset D consists of n i.i.d. (s, a, r, s ) tuples generated as (s, a) µ, r = R(s, a), s P( |s, a) for some data distribution µ. The paper discusses 'a pre-collected dataset' but does not name a specific public dataset or provide access details for any dataset used. |
| Dataset Splits | No | The paper is theoretical and does not describe empirical experiments or dataset usage with specific train/validation/test splits. |
| Hardware Specification | No | The paper is theoretical and does not report on empirical experiments, thus no hardware specifications for running experiments are provided. |
| Software Dependencies | No | The paper is theoretical and does not report on empirical experiments; therefore, no specific software dependencies with version numbers are mentioned. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or system-level training settings. |