Generalised Policy Improvement with Geometric Policy Composition
Authors: Shantanu Thakoor, Mark Rowland, Diana Borsa, Will Dabney, Remi Munos, Andre Barreto
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To understand how GSP evaluation using GHMs and GGPI perform at scale, we test them on a deep RL transfer task. Full details and further results are given in Appendix E. |
| Researcher Affiliation | Industry | 1Deep Mind, London. Correspondence to: Shantanu Thakoor <thakoor@deepmind.com>, Mark Rowland <markrowland@deepmind.com>. |
| Pseudocode | Yes | Algorithm 1 GGPI for sample-based policy iteration Algorithm 2 GGPI for sample-based transfer. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | No | The paper describes a custom environment setup ('sparse-reward ant') inspired by prior work but does not provide concrete access (link, DOI, specific citation to a downloadable dataset) to a publicly available, pre-existing dataset used for training. |
| Dataset Splits | No | The paper mentions 'training', 'validation', and 'test' phases, but it does not provide specific percentages or counts for dataset splits (e.g., '80/10/10 split') for reproducibility. |
| Hardware Specification | No | The paper mentions software used (Python, NumPy, Matplotlib, Jax, MuJoCo) and a 'distributed actor-learner setup' but provides no specific details about the hardware components (e.g., GPU models, CPU types, memory) used for the experiments. |
| Software Dependencies | No | The paper mentions several software components like 'Python', 'Num Py', 'Matplotlib', 'Jax', and 'Deep Mind JAX Ecosystem' along with their publication years in citations, but does not explicitly state specific version numbers for these software dependencies in the text. |
| Experiment Setup | Yes | The policies are implemented as a 4-layer MLP with 256 hidden units including layer normalisation (Ba et al., 2016) and tanh non-linearities. ... The policies are pretrained for 1 million update steps, using the Adam optimiser (Kingma & Ba, 2015) with a learning rate of 0.0003. |