reproducibilityindex.ai

Generalised Policy Improvement with Geometric Policy Composition

Authors: Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Remi Munos, Andre Barreto

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To understand how GSP evaluation using GHMs and GGPI perform at scale, we test them on a deep RL transfer task. Full details and further results are given in Appendix E.
Researcher Affiliation	Industry	1Deep Mind, London. Correspondence to: Shantanu Thakoor <thakoor@deepmind.com>, Mark Rowland <markrowland@deepmind.com>.
Pseudocode	Yes	Algorithm 1 GGPI for sample-based policy iteration Algorithm 2 GGPI for sample-based transfer.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code or a link to a code repository.
Open Datasets	No	The paper describes a custom environment setup ('sparse-reward ant') inspired by prior work but does not provide concrete access (link, DOI, specific citation to a downloadable dataset) to a publicly available, pre-existing dataset used for training.
Dataset Splits	No	The paper mentions 'training', 'validation', and 'test' phases, but it does not provide specific percentages or counts for dataset splits (e.g., '80/10/10 split') for reproducibility.
Hardware Specification	No	The paper mentions software used (Python, NumPy, Matplotlib, Jax, MuJoCo) and a 'distributed actor-learner setup' but provides no specific details about the hardware components (e.g., GPU models, CPU types, memory) used for the experiments.
Software Dependencies	No	The paper mentions several software components like 'Python', 'Num Py', 'Matplotlib', 'Jax', and 'Deep Mind JAX Ecosystem' along with their publication years in citations, but does not explicitly state specific version numbers for these software dependencies in the text.
Experiment Setup	Yes	The policies are implemented as a 4-layer MLP with 256 hidden units including layer normalisation (Ba et al., 2016) and tanh non-linearities. ... The policies are pretrained for 1 million update steps, using the Adam optimiser (Kingma & Ba, 2015) with a learning rate of 0.0003.