On the role of planning in model-based deep reinforcement learning
Authors: Jessica B Hamrick, Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Holger Buesing, Petar Veličković, Theophane Weber
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the performance of Mu Zero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of Mu Zero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization. |
| Researcher Affiliation | Industry | Jessica B. Hamrick , Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veliˇckovi c, Th eophane Weber Deep Mind, London, UK Correspondence addressed to: {jhamrick,theophane}@google.com |
| Pseudocode | Yes | Algorithm 1 Mu Zero [58] and Algorithm 2 MCTS in Mu Zero |
| Open Source Code | No | The paper does not contain an explicit statement about the release of its source code, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | We evaluate overall reward obtained by Mu Zero across a wide range standard MBRL environments: the Deep Mind Control Suite [70], Atari [8], Sokoban [51], Minipacman [22], and 9x9 Go [42]. |
| Dataset Splits | No | The paper does not specify explicit train/validation/test dataset splits with percentages or counts. For Minipacman, it mentions training on '5, 10, or 100 unique mazes' and testing on 'new mazes', which describes an experimental setup for generalization, but not a standard train/validation/test split for a fixed dataset. |
| Hardware Specification | Yes | All Minipacman experiments were run using 400 CPU-based actors and 1 NVIDIA V100 for the learner. All Atari experiments were run using 1024 CPU-based actors and 4 NVIDIA V100s for the learner. All Control Suite experiments were run using 1024 CPU-based actors and 2 second-generation (v2) Tensor Processing Units (TPUs) for the learner. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). It mentions Mu Jo Co for the Control Suite but not its version. |
| Experiment Setup | Yes | Table 1: Shared hyperparameters... Table 6: Hyperparameters for Go. These tables list specific values for learning rate, discount factor, batch size, n-step return length, replay samples per insert ratio, learner steps, policy loss weight, value loss weight, num simulations, etc. |