Multi-Task Deep Reinforcement Learning with PopArt

Authors: Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, Hado van Hasselt3796-3803

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our approach in two challenging multi-task benchmarks, Atari-57 and Dm Lab-30, based on Atari and Deep Mind respectively, and introduced by Espeholt et al. We also consider a new benchmark, consisting of the same 57 Atari games as Atari-57, but with the original unclipped reward scheme. We demonstrate state of the art performance on all three benchmarks.
Researcher Affiliation Industry Matteo Hessel Deep Mind Hubert Soyer Deep Mind Lasse Espeholt Deep Mind Wojciech Czarnecki Deep Mind Simon Schmitt Deep Mind Hado van Hasselt Deep Mind
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. It describes the algorithms and updates using mathematical equations and descriptive text.
Open Source Code Yes Note that an efficient implementation of IMPALA is available open-source 1, and that, while we use this agent for our experiments, our approach can be applied to other data parallel multi-task agents (e.g. A3C). 1www.github.com/deepmind/scalable-agent
Open Datasets Yes Atari-57 is a collection of 57 classic Atari 2600 games. The ALE (Bellemare et al. 2013), exposes them as RL environments. Dm Lab-30 is a benchmark consisting of 30 different visually rich, partially observable RL environments (Beattie et al. 2016).
Dataset Splits No The paper mentions using Population-Based Training (PBT) to adapt hyperparameters, which involves a form of internal validation, and refers to 'train' and 'test' aggregate scores for Dm Lab-30. However, it does not explicitly provide specific details (percentages, counts) for a separate validation dataset split.
Hardware Specification No The paper mentions using 'A single GPU learner' and running 'on a cloud service' with 'large CPU requirements' but does not specify any exact GPU models, CPU models, or other detailed hardware specifications.
Software Dependencies No The paper states 'We implemented all agents in Tensor Flow' but does not provide a specific version number for TensorFlow or any other software dependencies.
Experiment Setup Yes For each batch of rollouts processed by the learner, we averaged the Gv t targets within a rollout, and for each rollout in the batch we performed one online update of Pop Art s normalisation statistics with decay β = 3 10 4. Note that β didn t require any tuning. To prevent numerical issues, we clipped the scale σ in the range [0.0001, 1e6]. We did not backpropagate gradients into µ and σ, exclusively updated as in Equation 6. The weights W of the last layer of the value function were updated according to Equation 13 and 11. Note that we first applied the actor-critic updates (11), then updated the statistics (6), finally applied output preserving updates (13). In all experiments we used population-based training (PBT) to adapt hyperparameters during the course of training (Jaderberg et al. 2017). As in the IMPALA paper, we used PBT to tune learning rate, entropy cost, the optimiser s epsilon, and in the Atari experiments the max gradient norm. In Atari-57 we used populations of 24 instances, in Dm Lab-30 just 8 instances. For other hyperparameters we used the values from (Espeholt et al. 2018).