The Importance of Pessimism in Fixed-Dataset Policy Optimization
Authors: Jacob Buckman, Carles Gelada, Marc G Bellemare
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four Min Atar environments. |
| Researcher Affiliation | -1 | Anonymous authors Paper under double-blind review |
| Pseudocode | Yes | Appendix D ALGORITHMS Algorithm 1: Tabular Fixed-Dataset Policy Evaluation Input: Dataset D, policy π, discount γ. Construct r D, PD as described in Section 2; v (I γAπPD) 1 Aπr D; return v; |
| Open Source Code | Yes | For an open-source implementation, including full details suitable for replication, please refer to the code in the accompanying Git Hub repository: github.com/anonymized |
| Open Datasets | Yes | The second setting we evaluate on consists of four environments from the Min Atar suite (Young & Tian, 2019). |
| Dataset Splits | No | The paper mentions dataset sizes and how data is collected, but it does not specify explicit train/validation/test splits for the datasets. |
| Hardware Specification | No | The paper mentions |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | For both pessimistic algorithms, we absorb all constants into the hyperparameter α, which we selected to be α = 1 for both algorithms by a simple manual search. All experiments used identical hyperparameters. Hyperparameter tuning was done on just two experimental setups: BREAKOUT using ϵ = 0, and BREAKOUT using ϵ = 1. Tuning was very minimal, and done via a small manual search. In our experiments, approximately 250,000 gradient steps per target update were required to consistently minimize error enough to avoid divergence. |