An Investigation of Model-Free Planning

Authors: Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We measure our agent s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.
Researcher Affiliation Industry *Equal contribution 1Deep Mind, London, UK. Correspondence to: <{aguez, mmirza, rkabra, countzero}@google.com>.
Pseudocode No The network fθ is then repeated N times within each time-step (i.e., multiple internal ticks per real time-step). If st 1 is the state at the end of the previous time-step, we obtain the new state given the input it as: st = gθ(st 1, it) = fθ(fθ(. . . fθ(st 1, it), . . . , it), it) | {z } N times (1)
Open Source Code No We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels
Open Datasets Yes Sokoban A difficult puzzle domain requiring an agent to push a set of boxes onto goal locations (Botea et al., 2003; Racanière et al., 2017). ... We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels
Dataset Splits Yes We either train on a Large (900k levels), Medium-size (10k) or Small (1k) set all subsets of the Sokoban-unfiltered training set. ... Figures 5a-b compare these same trained models when tested on both the unfiltered and on the medium(-difficulty) test sets.
Hardware Specification No No specific hardware details (GPU models, CPU models, memory, etc.) were mentioned in the paper.
Software Dependencies No More specifically, we used a distributed framework and the IMPALA V-trace actor-critic algorithm (Espeholt et al., 2018). While we found this training regime to help for training networks with more parameters, we also ran experiments which demonstrate that the DRC architecture can be trained effectively with A3C (Mnih et al., 2016).
Experiment Setup Yes More details on the setup can be found in Appendix 9.2.