Learning and Planning in Complex Action Spaces

Authors: Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: Deep Mind Control Suite and Real-World RL Suite.
Researcher Affiliation Industry 1Deep Mind, London, UK. Correspondence to: Thomas Hubert <tkhubert@google.com>.
Pseudocode No The algorithm Sampled Mu Zero is described in Section 5 in prose, detailing its modifications to Mu Zero, without a formal pseudocode block or algorithm listing.
Open Source Code No The paper does not provide a direct link or explicit statement about the public availability of its source code.
Open Datasets Yes To demonstrate the generality of this approach, we apply our algorithm to two continuous control benchmark domains, the Deep Mind Control Suite (Tassa et al., 2018) and Real-World RL Suite (Dulac-Arnold et2020).
Dataset Splits No The paper mentions using '3 seeds per experiment' and refers to 'data budgets' and 'task classification' from other papers but does not provide specific train/validation/test dataset split information (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper does not specify the CPU, GPU models, memory, or any other specific hardware used for running the experiments.
Software Dependencies No While it states 'All models are implemented in JAX (Bradbury et al., 2018) using Haiku (Hennigan et al., 2020)', these are references to the software packages themselves and do not provide specific version numbers for JAX or Haiku used in the experiments. No other software dependencies with version numbers are listed.
Experiment Setup Yes Appendix A.3, Table 3 lists all hyperparameters used across all experiments, providing specific values for batch size, discount, learning rate schedule parameters (warmup steps, decay rate), Adam optimizer parameters (epsilon, beta1, beta2, weight decay), observation stack, LSTM hidden size, number of simulations, and various loss coefficients.