An Investigation of Model-Free Planning
Authors: Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measure our agent s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efficiency, and its ability to utilize additional thinking time. We find that our agent has many of the characteristics that one might expect to find in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning. |
| Researcher Affiliation | Industry | *Equal contribution 1Deep Mind, London, UK. Correspondence to: <{aguez, mmirza, rkabra, countzero}@google.com>. |
| Pseudocode | No | The network fθ is then repeated N times within each time-step (i.e., multiple internal ticks per real time-step). If st 1 is the state at the end of the previous time-step, we obtain the new state given the input it as: st = gθ(st 1, it) = fθ(fθ(. . . fθ(st 1, it), . . . , it), it) | {z } N times (1) |
| Open Source Code | No | We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels |
| Open Datasets | Yes | Sokoban A difficult puzzle domain requiring an agent to push a set of boxes onto goal locations (Botea et al., 2003; Racanière et al., 2017). ... We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels |
| Dataset Splits | Yes | We either train on a Large (900k levels), Medium-size (10k) or Small (1k) set all subsets of the Sokoban-unfiltered training set. ... Figures 5a-b compare these same trained models when tested on both the unfiltered and on the medium(-difficulty) test sets. |
| Hardware Specification | No | No specific hardware details (GPU models, CPU models, memory, etc.) were mentioned in the paper. |
| Software Dependencies | No | More specifically, we used a distributed framework and the IMPALA V-trace actor-critic algorithm (Espeholt et al., 2018). While we found this training regime to help for training networks with more parameters, we also ran experiments which demonstrate that the DRC architecture can be trained effectively with A3C (Mnih et al., 2016). |
| Experiment Setup | Yes | More details on the setup can be found in Appendix 9.2. |