Average-Reward Off-Policy Policy Evaluation with Function Approximation

Authors: Shangtong Zhang, Yi Wan, Richard S Sutton, Shimon Whiteson

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
Researcher Affiliation Academia 1University of Oxford 2University of Alberta.
Pseudocode No The paper describes the algorithmic updates for Diff-SGQ, Diff-GQ1, and Diff-GQ2 using numbered equations (3, 4, 11, 15, 16) within the main text, but these are not presented in a structured block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes All the implementations are publicly available. 1https://github.com/Shangtong Zhang/Deep RL
Open Datasets Yes We benchmark Diff SGQ, Diff-GQ1, Diff-GQ2, and Gradient DICE in a variant of Boyan s chain (Boyan, 1999)...
Dataset Splits No The paper mentions 'grid search with 30 independent runs for hyperparameter tuning' but does not specify explicit training, validation, or test dataset splits or cross-validation methods.
Hardware Specification No The acknowledgments state 'The experiments were made possible by a generous equipment grant from NVIDIA,' but no specific GPU model or other hardware specifications are provided.
Software Dependencies No The paper mentions software like MuJoCo for environments and algorithms like TD3, but it does not specify version numbers for any programming languages, libraries, or other software dependencies.
Experiment Setup Yes We use constant learning rates α for all compared algorithms, which is tuned in 2 20, 2 19, . . . , 2 1 . For Diff GQ1 and Diff-GQ2, besides tuning α in the same way as Diff-SGQ, we tune η in {0, 0.01, 0.1}. For Gradient DICE, besides tuning (α, η) in the same way as Diff-GQ1, we tune λ, the weight for a normalizing term, in {0, 0.1, 1, 10}.