Policy improvement by planning with Gumbel

Authors: Ivo Danihelka, Arthur Guez, Julian Schrittwieser, David Silver

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conducted the experiments on Go, chess, and Atari. We present the main results here and we report additional ablations and experimental details in Appendix F. On Go, we use Elo to compare Mu Zero and other agents. While an agent trains by self-play, its Elo is computed by evaluation versus reference opponents.
Researcher Affiliation Collaboration Ivo Danihelka1 2, Arthur Guez1, Julian Schrittwieser1, David Silver1 2 1DeepMind, London, UK 2University College London
Pseudocode Yes Algorithm 1 Policy Improvement by Planning with Gumbel; Algorithm 2 Sequential Halving with Gumbel
Open Source Code Yes and the released open-source code.3 https://github.com/deepmind/mctx
Open Datasets Yes Our new algorithms, Gumbel Alpha Zero and Gumbel Mu Zero...match the state of the art on Go, chess, and Atari...We conducted the experiments on Go, chess, and Atari. We use the Arcade Learning Environment (Bellemare et al., 2013)
Dataset Splits No The paper describes training and evaluation using self-play and comparisons but does not provide explicit training, validation, or test dataset splits in terms of percentages, sample counts, or defined methodologies for data partitioning.
Hardware Specification Yes To run the experiments, we used Google Cloud Tensor Processing Units v3 (TPUs).
Software Dependencies No The paper mentions “JAX” and “Haiku” (in Figure 11) but does not provide specific version numbers for these software components.
Experiment Setup Yes In the five plots, the n varies from 2 to 200. In all Go and chess experiments, Gumbel Mu Zero scales the Q-values by cvisit = 50 and cscale = 1.0. For Gumbel Mu Zero, we use the same normalization and we scale the normalized Q-values by cvisit = 50 and cscale = 0.1.