The Importance of Pessimism in Fixed-Dataset Policy Optimization

Authors: Jacob Buckman, Carles Gelada, Marc G Bellemare

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These theoretical findings are validated by experiments on a tabular gridworld, and deep learning experiments on four Min Atar environments.
Researcher Affiliation -1 Anonymous authors Paper under double-blind review
Pseudocode Yes Appendix D ALGORITHMS Algorithm 1: Tabular Fixed-Dataset Policy Evaluation Input: Dataset D, policy π, discount γ. Construct r D, PD as described in Section 2; v (I γAπPD) 1 Aπr D; return v;
Open Source Code Yes For an open-source implementation, including full details suitable for replication, please refer to the code in the accompanying Git Hub repository: github.com/anonymized
Open Datasets Yes The second setting we evaluate on consists of four environments from the Min Atar suite (Young & Tian, 2019).
Dataset Splits No The paper mentions dataset sizes and how data is collected, but it does not specify explicit train/validation/test splits for the datasets.
Hardware Specification No The paper mentions
Software Dependencies No The paper mentions
Experiment Setup Yes For both pessimistic algorithms, we absorb all constants into the hyperparameter α, which we selected to be α = 1 for both algorithms by a simple manual search. All experiments used identical hyperparameters. Hyperparameter tuning was done on just two experimental setups: BREAKOUT using ϵ = 0, and BREAKOUT using ϵ = 1. Tuning was very minimal, and done via a small manual search. In our experiments, approximately 250,000 gradient steps per target update were required to consistently minimize error enough to avoid divergence.