IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures
Authors: Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, Shane Legg, Koray Kavukcuoglu
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the Deep Mind Lab environment (Beattie et al., 2016)) and Atari57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). Our results show that IMPALA is able to achieve better performance than previous agents with less data, and crucially exhibits positive transfer between tasks as a result of its multi-task approach. |
| Researcher Affiliation | Industry | 1Deep Mind Technologies, London, United Kingdom. |
| Pseudocode | No | The paper describes the V-trace actor-critic algorithm in prose within Section 4.2 but does not provide a formal pseudocode block or algorithm figure. |
| Open Source Code | No | The paper does not explicitly state that the source code for the IMPALA methodology is openly available or provide a link to it. It only links to the DMLab environment itself (github.com/deepmind/lab). |
| Open Datasets | Yes | We demonstrate the effectiveness of IMPALA for multi-task reinforcement learning on DMLab-30 (a set of 30 tasks from the Deep Mind Lab environment (Beattie et al., 2016)) and Atari57 (all available Atari games in Arcade Learning Environment (Bellemare et al., 2013a)). A detailed description of DMLab-30 and the tasks are available at github.com/deepmind/lab and deepmind.com/dm-lab-30. |
| Dataset Splits | No | We perform hyperparameter sweeps over the weighting of entropy regularisation, the learning rate and the RMSProp epsilon. For each experiment we use an identical set of 24 pre-sampled hyperparameter combinations from the ranges in Appendix D.1. While the paper describes hyperparameter tuning, it does not provide specific train/validation/test dataset splits as percentages or sample counts for the data within the environments. |
| Hardware Specification | Yes | 1 Nvidia P100 (Footnote to Table 1) |
| Software Dependencies | No | Finally, we also make use of several off the shelf optimisations available in Tensor Flow (Abadi et al., 2017) such as preparing the next batch of data for the learner while still performing computation, compiling parts of the computational graph with XLA (a Tensor Flow Just-In-Time compiler) and optimising the data format to get the maximum performance from the cu DNN framework (Chetlur et al., 2014). The paper mentions these software components but does not specify version numbers. |
| Experiment Setup | Yes | We perform hyperparameter sweeps over the weighting of entropy regularisation, the learning rate and the RMSProp epsilon. For each experiment we use an identical set of 24 pre-sampled hyperparameter combinations from the ranges in Appendix D.1. The other hyperparameters were fixed to values specified in Appendix D.3. |