Multi-Task Deep Reinforcement Learning with PopArt
Authors: Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, Hado van Hasselt3796-3803
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluated our approach in two challenging multi-task benchmarks, Atari-57 and Dm Lab-30, based on Atari and Deep Mind respectively, and introduced by Espeholt et al. We also consider a new benchmark, consisting of the same 57 Atari games as Atari-57, but with the original unclipped reward scheme. We demonstrate state of the art performance on all three benchmarks. |
| Researcher Affiliation | Industry | Matteo Hessel Deep Mind Hubert Soyer Deep Mind Lasse Espeholt Deep Mind Wojciech Czarnecki Deep Mind Simon Schmitt Deep Mind Hado van Hasselt Deep Mind |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. It describes the algorithms and updates using mathematical equations and descriptive text. |
| Open Source Code | Yes | Note that an efficient implementation of IMPALA is available open-source 1, and that, while we use this agent for our experiments, our approach can be applied to other data parallel multi-task agents (e.g. A3C). 1www.github.com/deepmind/scalable-agent |
| Open Datasets | Yes | Atari-57 is a collection of 57 classic Atari 2600 games. The ALE (Bellemare et al. 2013), exposes them as RL environments. Dm Lab-30 is a benchmark consisting of 30 different visually rich, partially observable RL environments (Beattie et al. 2016). |
| Dataset Splits | No | The paper mentions using Population-Based Training (PBT) to adapt hyperparameters, which involves a form of internal validation, and refers to 'train' and 'test' aggregate scores for Dm Lab-30. However, it does not explicitly provide specific details (percentages, counts) for a separate validation dataset split. |
| Hardware Specification | No | The paper mentions using 'A single GPU learner' and running 'on a cloud service' with 'large CPU requirements' but does not specify any exact GPU models, CPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper states 'We implemented all agents in Tensor Flow' but does not provide a specific version number for TensorFlow or any other software dependencies. |
| Experiment Setup | Yes | For each batch of rollouts processed by the learner, we averaged the Gv t targets within a rollout, and for each rollout in the batch we performed one online update of Pop Art s normalisation statistics with decay β = 3 10 4. Note that β didn t require any tuning. To prevent numerical issues, we clipped the scale σ in the range [0.0001, 1e6]. We did not backpropagate gradients into µ and σ, exclusively updated as in Equation 6. The weights W of the last layer of the value function were updated according to Equation 13 and 11. Note that we first applied the actor-critic updates (11), then updated the statistics (6), finally applied output preserving updates (13). In all experiments we used population-based training (PBT) to adapt hyperparameters during the course of training (Jaderberg et al. 2017). As in the IMPALA paper, we used PBT to tune learning rate, entropy cost, the optimiser s epsilon, and in the Atari experiments the max gradient norm. In Atari-57 we used populations of 24 instances, in Dm Lab-30 just 8 instances. For other hyperparameters we used the values from (Espeholt et al. 2018). |