Woulda, Coulda, Shoulda: Counterfactually-Guided Policy Search
Authors: Lars Buesing, Theophane Weber, Yori Zwols, Nicolas Heess, Sebastien Racaniere, Arthur Guez, Jean-Baptiste Lespiau
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find empirically that these advantages translate into improved policy evaluation and search results on a non-trivial grid-world task. |
| Researcher Affiliation | Industry | Lars Buesing, Th eophane Weber, Yori Zwols, S ebastien Racani ere, Arthur Guez, Jean-Baptiste Lespiau, Nicolas Heess Deep Mind lbuesing@google.com |
| Pseudocode | Yes | Algorithm 1 Counterfactual policy evaluation and search |
| Open Source Code | No | The paper does not provide an explicit statement or link for open-source code. |
| Open Datasets | No | The paper mentions using the PO-SOKOBAN environment but does not provide access information (link, DOI, citation with author/year) for a publicly available dataset. It describes how initial states are 'generated randomly by a generator algorithm' and that 'data ˆhi T was collected under a uniform random policy µ'. |
| Dataset Splits | No | The paper does not provide specific dataset split information (percentages, sample counts, or citations to predefined splits) needed to reproduce the data partitioning into train, validation, and test sets. It mentions 'training a separate model for each t {0, 5, 10, 20, 30, 40, 50}' and using '> 10^5 levels ui from the inferred model' for policy evaluations, but no clear splits are defined. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. It mentions 'computational power' but no specifications. |
| Software Dependencies | No | The paper mentions using the ADAM optimizer (Kingma & Ba, 2014) and the reparametrization trick (Kingma & Welling, 2013; Rezende et al., 2014), as well as a 'convolutional LSTM model'. However, it does not specify version numbers for these software components or libraries. |
| Experiment Setup | Yes | The policy is parameterized as a deep, recurrent neural network consisting of a 3-layer deep convolutional LSTM (Xingjian et al., 2015) with 32 channels per layer and kernel size of 3. ... The model (together with the backward RNN) was trained with the ADAM optimizer (Kingma & Ba, 2014) on the ELBO loss using the reparametrization trick (Kingma & Welling, 2013; Rezende et al., 2014). The mini-batch size was set to 4 and the learning rate to 3e 4. |