reproducibilityindex.ai

Policy improvement by planning with Gumbel

Authors: Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conducted the experiments on Go, chess, and Atari. We present the main results here and we report additional ablations and experimental details in Appendix F. On Go, we use Elo to compare Mu Zero and other agents. While an agent trains by self-play, its Elo is computed by evaluation versus reference opponents.
Researcher Affiliation	Collaboration	Ivo Danihelka1 2, Arthur Guez1, Julian Schrittwieser1, David Silver1 2 1DeepMind, London, UK 2University College London
Pseudocode	Yes	Algorithm 1 Policy Improvement by Planning with Gumbel; Algorithm 2 Sequential Halving with Gumbel
Open Source Code	Yes	and the released open-source code.3 https://github.com/deepmind/mctx
Open Datasets	Yes	Our new algorithms, Gumbel Alpha Zero and Gumbel Mu Zero...match the state of the art on Go, chess, and Atari...We conducted the experiments on Go, chess, and Atari. We use the Arcade Learning Environment (Bellemare et al., 2013)
Dataset Splits	No	The paper describes training and evaluation using self-play and comparisons but does not provide explicit training, validation, or test dataset splits in terms of percentages, sample counts, or defined methodologies for data partitioning.
Hardware Specification	Yes	To run the experiments, we used Google Cloud Tensor Processing Units v3 (TPUs).
Software Dependencies	No	The paper mentions “JAX” and “Haiku” (in Figure 11) but does not provide specific version numbers for these software components.
Experiment Setup	Yes	In the ﬁve plots, the n varies from 2 to 200. In all Go and chess experiments, Gumbel Mu Zero scales the Q-values by cvisit = 50 and cscale = 1.0. For Gumbel Mu Zero, we use the same normalization and we scale the normalized Q-values by cvisit = 50 and cscale = 0.1.