Rapid Task-Solving in Novel Environments

Authors: Samuel Ritter, Ryan Faulkner, Laurent Sartran, Adam Santoro, Matthew Botvinick, David Raposo

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that state-of-the-art deep RL agents fail at RTS in both domains, and that this failure is due to an inability to plan over gathered knowledge. We develop Episodic Planning Networks (EPNs) and show that deep-RL agents with EPNs excel at RTS, outperforming the nearest baseline by factors of 2-3 and learning to navigate held-out Street Learn maps within a single episode. Training curves. Performance measured by the average reward per episode, which corresponds to the average number of tasks completed within a 100-step episode (showing the best runs from a large hyper-parameter sweep for each model).
Researcher Affiliation Industry Samuel Ritter , Ryan Faulkner , Laurent Sartran, Adam Santoro, Matthew Botvinick, David Raposo Deep Mind London, UK {ritters, rfaulk, lsartran, adamsantoro, botvinick, draposo}@google.com
Pseudocode No No structured pseudocode or algorithm blocks were found. The paper describes the architecture and update functions in text and mathematical notation, but not in a pseudocode format.
Open Source Code No The paper does not contain any statement about releasing source code or links to a code repository.
Open Datasets Yes We introduce the One-Shot Street Learn domain (see Figure 1a), wherein environments are sampled as neighborhoods from the Street Learn dataset of Google Street View images and their connectivity (Mirowski et al., 2019).
Dataset Splits No The paper mentions training and testing on held-out data (city/neighborhoods) but does not provide specific numerical splits (percentages or counts) for a validation dataset, nor does it describe a detailed cross-validation setup.
Hardware Specification No The paper states that 'The distributed agent consisted of 1000 actors that produced trajectories of experience on CPU, and a single learner running on a Tensor Processing Unit (TPU)', but does not specify the models or versions of the CPU or TPU used.
Software Dependencies No The paper mentions using 'IMPALA, a framework for distributed RL training' and 'RMSprop optimization algorithm', but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Please see the table below for values of fixed hyperparameters and intervals used for hyperparameter tuning. Hyperparameter Values Agent Mini-batch size [32, 128] Unroll length [10, 40] Entropy cost [1e-3, 1e-2] Discount γ [0.9, 0.95] RMSprop Learning rate [1e-5, 4e-4] Epsilon ε 1e-4 Momentum 0 Decay 0.99