Average-Reward Off-Policy Policy Evaluation with Function Approximation
Authors: Shangtong Zhang, Yi Wan, Richard S Sutton, Shimon Whiteson
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks. |
| Researcher Affiliation | Academia | 1University of Oxford 2University of Alberta. |
| Pseudocode | No | The paper describes the algorithmic updates for Diff-SGQ, Diff-GQ1, and Diff-GQ2 using numbered equations (3, 4, 11, 15, 16) within the main text, but these are not presented in a structured block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | All the implementations are publicly available. 1https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We benchmark Diff SGQ, Diff-GQ1, Diff-GQ2, and Gradient DICE in a variant of Boyan s chain (Boyan, 1999)... |
| Dataset Splits | No | The paper mentions 'grid search with 30 independent runs for hyperparameter tuning' but does not specify explicit training, validation, or test dataset splits or cross-validation methods. |
| Hardware Specification | No | The acknowledgments state 'The experiments were made possible by a generous equipment grant from NVIDIA,' but no specific GPU model or other hardware specifications are provided. |
| Software Dependencies | No | The paper mentions software like MuJoCo for environments and algorithms like TD3, but it does not specify version numbers for any programming languages, libraries, or other software dependencies. |
| Experiment Setup | Yes | We use constant learning rates α for all compared algorithms, which is tuned in 2 20, 2 19, . . . , 2 1 . For Diff GQ1 and Diff-GQ2, besides tuning α in the same way as Diff-SGQ, we tune η in {0, 0.01, 0.1}. For Gradient DICE, besides tuning (α, η) in the same way as Diff-GQ1, we tune λ, the weight for a normalizing term, in {0, 0.1, 1, 10}. |