On the role of planning in model-based deep reinforcement learning

Authors: Jessica B Hamrick, Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Holger Buesing, Petar Veličković, Theophane Weber

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the performance of Mu Zero [58], a state-of-the-art MBRL algorithm with strong connections and overlapping components with many other MBRL algorithms. We perform a number of interventions and ablations of Mu Zero across a wide range of environments, including control tasks, Atari, and 9x9 Go. Our results suggest the following: (1) Planning is most useful in the learning process, both for policy updates and for providing a more useful data distribution. (2) Using shallow trees with simple Monte-Carlo rollouts is as performant as more complex methods, except in the most difficult reasoning tasks. (3) Planning alone is insufficient to drive strong generalization.
Researcher Affiliation Industry Jessica B. Hamrick , Abram L. Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veliˇckovi c, Th eophane Weber Deep Mind, London, UK Correspondence addressed to: {jhamrick,theophane}@google.com
Pseudocode Yes Algorithm 1 Mu Zero [58] and Algorithm 2 MCTS in Mu Zero
Open Source Code No The paper does not contain an explicit statement about the release of its source code, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes We evaluate overall reward obtained by Mu Zero across a wide range standard MBRL environments: the Deep Mind Control Suite [70], Atari [8], Sokoban [51], Minipacman [22], and 9x9 Go [42].
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits with percentages or counts. For Minipacman, it mentions training on '5, 10, or 100 unique mazes' and testing on 'new mazes', which describes an experimental setup for generalization, but not a standard train/validation/test split for a fixed dataset.
Hardware Specification Yes All Minipacman experiments were run using 400 CPU-based actors and 1 NVIDIA V100 for the learner. All Atari experiments were run using 1024 CPU-based actors and 4 NVIDIA V100s for the learner. All Control Suite experiments were run using 1024 CPU-based actors and 2 second-generation (v2) Tensor Processing Units (TPUs) for the learner.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, TensorFlow, PyTorch versions). It mentions Mu Jo Co for the Control Suite but not its version.
Experiment Setup Yes Table 1: Shared hyperparameters... Table 6: Hyperparameters for Go. These tables list specific values for learning rate, discount factor, batch size, n-step return length, replay samples per insert ratio, learner steps, policy loss weight, value loss weight, num simulations, etc.