reproducibilityindex.ai

An Investigation of Model-Free Planning

Authors: Arthur Guez, Mehdi Mirza, Karol Gregor, Rishabh Kabra, Sebastien Racaniere, Theophane Weber, David Raposo, Adam Santoro, Laurent Orseau, Tom Eccles, Greg Wayne, David Silver, Timothy Lillicrap

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We measure our agent s effectiveness at planning in terms of its ability to generalize across a combinatorial and irreversible state space, its data efﬁciency, and its ability to utilize additional thinking time. We ﬁnd that our agent has many of the characteristics that one might expect to ﬁnd in a planning algorithm. Furthermore, it exceeds the state-of-the-art in challenging combinatorial domains such as Sokoban and outperforms other model-free approaches that utilize strong inductive biases toward planning.
Researcher Affiliation	Industry	*Equal contribution 1Deep Mind, London, UK. Correspondence to: <{aguez, mmirza, rkabra, countzero}@google.com>.
Pseudocode	No	The network fθ is then repeated N times within each time-step (i.e., multiple internal ticks per real time-step). If st 1 is the state at the end of the previous time-step, we obtain the new state given the input it as: st = gθ(st 1, it) = fθ(fθ(. . . fθ(st 1, it), . . . , it), it) \| {z } N times (1)
Open Source Code	No	We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels
Open Datasets	Yes	Sokoban A difﬁcult puzzle domain requiring an agent to push a set of boxes onto goal locations (Botea et al., 2003; Racanière et al., 2017). ... We are releasing these levels as datasets in the standard Sokoban format1. 1https://github.com/deepmind/boxoban-levels
Dataset Splits	Yes	We either train on a Large (900k levels), Medium-size (10k) or Small (1k) set all subsets of the Sokoban-unﬁltered training set. ... Figures 5a-b compare these same trained models when tested on both the unﬁltered and on the medium(-difﬁculty) test sets.
Hardware Specification	No	No specific hardware details (GPU models, CPU models, memory, etc.) were mentioned in the paper.
Software Dependencies	No	More speciﬁcally, we used a distributed framework and the IMPALA V-trace actor-critic algorithm (Espeholt et al., 2018). While we found this training regime to help for training networks with more parameters, we also ran experiments which demonstrate that the DRC architecture can be trained effectively with A3C (Mnih et al., 2016).
Experiment Setup	Yes	More details on the setup can be found in Appendix 9.2.