Policy improvement by planning with Gumbel
Authors: Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted the experiments on Go, chess, and Atari. We present the main results here and we report additional ablations and experimental details in Appendix F. On Go, we use Elo to compare Mu Zero and other agents. While an agent trains by self-play, its Elo is computed by evaluation versus reference opponents. |
| Researcher Affiliation | Collaboration | Ivo Danihelka1 2, Arthur Guez1, Julian Schrittwieser1, David Silver1 2 1DeepMind, London, UK 2University College London |
| Pseudocode | Yes | Algorithm 1 Policy Improvement by Planning with Gumbel; Algorithm 2 Sequential Halving with Gumbel |
| Open Source Code | Yes | and the released open-source code.3 https://github.com/deepmind/mctx |
| Open Datasets | Yes | Our new algorithms, Gumbel Alpha Zero and Gumbel Mu Zero...match the state of the art on Go, chess, and Atari...We conducted the experiments on Go, chess, and Atari. We use the Arcade Learning Environment (Bellemare et al., 2013) |
| Dataset Splits | No | The paper describes training and evaluation using self-play and comparisons but does not provide explicit training, validation, or test dataset splits in terms of percentages, sample counts, or defined methodologies for data partitioning. |
| Hardware Specification | Yes | To run the experiments, we used Google Cloud Tensor Processing Units v3 (TPUs). |
| Software Dependencies | No | The paper mentions “JAX” and “Haiku” (in Figure 11) but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | In the five plots, the n varies from 2 to 200. In all Go and chess experiments, Gumbel Mu Zero scales the Q-values by cvisit = 50 and cscale = 1.0. For Gumbel Mu Zero, we use the same normalization and we scale the normalized Q-values by cvisit = 50 and cscale = 0.1. |