reproducibilityindex.ai

The Importance of Pessimism in Fixed-Dataset Policy Optimization

Authors: Jacob Buckman, Carles Gelada, Marc G Bellemare

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	These theoretical ﬁndings are validated by experiments on a tabular gridworld, and deep learning experiments on four Min Atar environments.
Researcher Affiliation	-1	Anonymous authors Paper under double-blind review
Pseudocode	Yes	Appendix D ALGORITHMS Algorithm 1: Tabular Fixed-Dataset Policy Evaluation Input: Dataset D, policy π, discount γ. Construct r D, PD as described in Section 2; v (I γAπPD) 1 Aπr D; return v;
Open Source Code	Yes	For an open-source implementation, including full details suitable for replication, please refer to the code in the accompanying Git Hub repository: github.com/anonymized
Open Datasets	Yes	The second setting we evaluate on consists of four environments from the Min Atar suite (Young & Tian, 2019).
Dataset Splits	No	The paper mentions dataset sizes and how data is collected, but it does not specify explicit train/validation/test splits for the datasets.
Hardware Specification	No	The paper mentions
Software Dependencies	No	The paper mentions
Experiment Setup	Yes	For both pessimistic algorithms, we absorb all constants into the hyperparameter α, which we selected to be α = 1 for both algorithms by a simple manual search. All experiments used identical hyperparameters. Hyperparameter tuning was done on just two experimental setups: BREAKOUT using ϵ = 0, and BREAKOUT using ϵ = 1. Tuning was very minimal, and done via a small manual search. In our experiments, approximately 250,000 gradient steps per target update were required to consistently minimize error enough to avoid divergence.