IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Authors: Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the Deep Mind Lab environment (Beattie et al., 2016)) and Atari57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach.
Researcher Affiliation Industry 1Deep Mind Technologies, London, United Kingdom.
Pseudocode No The paper describes the V-trace actor-critic algorithm in prose within Section 4.2 but does not provide a formal pseudocode block or algorithm figure.
Open Source Code No The paper does not explicitly state that the source code for the IMPALA methodology is openly available or provide a link to it. It only links to the DMLab environment itself (github.com/deepmind/lab).
Open Datasets Yes We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the Deep Mind Lab environment (Beattie et al., 2016)) and Atari57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). A detailed description of DMLab-30 and the tasks are available at github.com/deepmind/lab and deepmind.com/dm-lab-30.
Dataset Splits No We perform hyperparameter sweeps over the weighting of entropy regularisation, the learning rate and the RMSProp epsilon. For each experiment we use an identical set of 24 pre-sampled hyperparameter combinations from the ranges in Appendix D.1. While the paper describes hyperparameter tuning, it does not provide specific train/validation/test dataset splits as percentages or sample counts for the data within the environments.
Hardware Specification Yes 1 Nvidia P100 (Footnote to Table 1)
Software Dependencies No Finally, we also make use of several off the shelf optimisations available in Tensor Flow (Abadi et al., 2017) such as preparing the next batch of data for the learner while still performing computation, compiling parts of the computational graph with XLA (a Tensor Flow Just-In-Time compiler) and optimising the data format to get the maximum performance from the cu DNN framework (Chetlur et al., 2014). The paper mentions these software components but does not specify version numbers.
Experiment Setup Yes We perform hyperparameter sweeps over the weighting of entropy regularisation, the learning rate and the RMSProp epsilon. For each experiment we use an identical set of 24 pre-sampled hyperparameter combinations from the ranges in Appendix D.1. The other hyperparameters were fixed to values specified in Appendix D.3.