reproducibilityindex.ai

Average-Reward Off-Policy Policy Evaluation with Function Approximation

Authors: Shangtong Zhang, Yi Wan, Richard S Sutton, Shimon Whiteson

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
Researcher Affiliation	Academia	1University of Oxford 2University of Alberta.
Pseudocode	No	The paper describes the algorithmic updates for Diff-SGQ, Diff-GQ1, and Diff-GQ2 using numbered equations (3, 4, 11, 15, 16) within the main text, but these are not presented in a structured block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	All the implementations are publicly available. 1https://github.com/Shangtong Zhang/Deep RL
Open Datasets	Yes	We benchmark Diff SGQ, Diff-GQ1, Diff-GQ2, and Gradient DICE in a variant of Boyan s chain (Boyan, 1999)...
Dataset Splits	No	The paper mentions 'grid search with 30 independent runs for hyperparameter tuning' but does not specify explicit training, validation, or test dataset splits or cross-validation methods.
Hardware Specification	No	The acknowledgments state 'The experiments were made possible by a generous equipment grant from NVIDIA,' but no specific GPU model or other hardware specifications are provided.
Software Dependencies	No	The paper mentions software like MuJoCo for environments and algorithms like TD3, but it does not specify version numbers for any programming languages, libraries, or other software dependencies.
Experiment Setup	Yes	We use constant learning rates α for all compared algorithms, which is tuned in 2 20, 2 19, . . . , 2 1 . For Diff GQ1 and Diff-GQ2, besides tuning α in the same way as Diff-SGQ, we tune η in {0, 0.01, 0.1}. For Gradient DICE, besides tuning (α, η) in the same way as Diff-GQ1, we tune λ, the weight for a normalizing term, in {0, 0.1, 1, 10}.