Learning and Planning in Complex Action Spaces
Authors: Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate this approach on the classical board game of Go and on two continuous control benchmark domains: Deep Mind Control Suite and Real-World RL Suite. |
| Researcher Affiliation | Industry | 1Deep Mind, London, UK. Correspondence to: Thomas Hubert <tkhubert@google.com>. |
| Pseudocode | No | The algorithm Sampled Mu Zero is described in Section 5 in prose, detailing its modifications to Mu Zero, without a formal pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not provide a direct link or explicit statement about the public availability of its source code. |
| Open Datasets | Yes | To demonstrate the generality of this approach, we apply our algorithm to two continuous control benchmark domains, the Deep Mind Control Suite (Tassa et al., 2018) and Real-World RL Suite (Dulac-Arnold et2020). |
| Dataset Splits | No | The paper mentions using '3 seeds per experiment' and refers to 'data budgets' and 'task classification' from other papers but does not provide specific train/validation/test dataset split information (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not specify the CPU, GPU models, memory, or any other specific hardware used for running the experiments. |
| Software Dependencies | No | While it states 'All models are implemented in JAX (Bradbury et al., 2018) using Haiku (Hennigan et al., 2020)', these are references to the software packages themselves and do not provide specific version numbers for JAX or Haiku used in the experiments. No other software dependencies with version numbers are listed. |
| Experiment Setup | Yes | Appendix A.3, Table 3 lists all hyperparameters used across all experiments, providing specific values for batch size, discount, learning rate schedule parameters (warmup steps, decay rate), Adam optimizer parameters (epsilon, beta1, beta2, weight decay), observation stack, LSTM hidden size, number of simulations, and various loss coefficients. |