reproducibilityindex.ai

Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search

Authors: Lars Buesing, Theophane Weber, Yori Zwols, Nicolas Heess, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁnd empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task.
Researcher Affiliation	Industry	Lars Buesing, Th eophane Weber, Yori Zwols, S ebastien Racani ere, Arthur Guez, Jean-Baptiste Lespiau, Nicolas Heess Deep Mind lbuesing@google.com
Pseudocode	Yes	Algorithm 1 Counterfactual policy evaluation and search
Open Source Code	No	The paper does not provide an explicit statement or link for open-source code.
Open Datasets	No	The paper mentions using the PO-SOKOBAN environment but does not provide access information (link, DOI, citation with author/year) for a publicly available dataset. It describes how initial states are 'generated randomly by a generator algorithm' and that 'data ˆhi T was collected under a uniform random policy µ'.
Dataset Splits	No	The paper does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning into train, validation, and test sets. It mentions 'training a separate model for each t {0, 5, 10, 20, 30, 40, 50}' and using '> 10^5 levels ui from the inferred model' for policy evaluations, but no clear splits are defined.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. It mentions 'computational power' but no specifications.
Software Dependencies	No	The paper mentions using the ADAM optimizer (Kingma & Ba, 2014) and the reparametrization trick (Kingma & Welling, 2013; Rezende et al., 2014), as well as a 'convolutional LSTM model'. However, it does not specify version numbers for these software components or libraries.
Experiment Setup	Yes	The policy is parameterized as a deep, recurrent neural network consisting of a 3-layer deep convolutional LSTM (Xingjian et al., 2015) with 32 channels per layer and kernel size of 3. ... The model (together with the backward RNN) was trained with the ADAM optimizer (Kingma & Ba, 2014) on the ELBO loss using the reparametrization trick (Kingma & Welling, 2013; Rezende et al., 2014). The mini-batch size was set to 4 and the learning rate to 3e 4.